Hi, I’m pretty new to both machine learning and word2vec and this group, so my apoligies if this topic is out of place, in the wrong board, or not appropriate. This semester, my professor has asked me to investigate word2vec, by T Milokov and his team at Google, and particularly with regards to machine translation. For this task, I’m using the implementation of word2vec in the gensim package for python. In the paper (link below) Milokov describes how after training two monolingual models, they generate a translation matrix on the most frequently occurring 5000 words, and using this translation matrix, evaluate the accuracy of the translations of the following 1000 words. Paper: http://arxiv.org/pdf/1309.4168.pdf Here are two screencaps, one of the description of the matrix in the paper and one of some clarification Milokov posted on the word2vec board. I have been playing around with the models I have generated for Japanese and English in gensim quite a bit. I downloaded wikipedia dumps, processed and tokenised them, and generated the models with gensim. I would like to emulate what Milokov did and see how accurate the translation are for Japanese/English (my two languages). I am unsure how to get the top 6000 words (5000 for make the trans vector, 1000 for testing), and especially how to produce the vector. I have read the papers and seen the algorithms but can’t quite put it into code. If anyone has some ideas/suggestions on how to do so, provide some pseudocode or has gensim knowledge and can lend a hand it would be greatly appreciated. I’m very motivated for this task but having difficulty progressing. Also there is this topic (link below) about this same thing from the word2vec group. I tried to adopt the process in the python scripts down the bottom posts, but to no avail. Running one just caused my PC to hang, the other executed but I don’t understand what it is doing. Source.