word2vec训练中⽂模型
1.准备数据与预处理
⾸先需要⼀份⽐较⼤的中⽂语料数据,可以考虑中⽂的维基百科(也可以试试搜狗的新闻语料库)。中⽂维基百科的打包⽂件地址为 中⽂维基百科的数据不是太⼤,xml的压缩⽂件⼤约1G左右。⾸先⽤ process_wiki_data.py处理这个XML压缩⽂件,执⾏:python
process_wiki_data.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text
1. #!/usr/bin/env python2. # -*- coding: utf-8 -*-3. # process_wiki_data.py ⽤于解析XML,将XML的wiki数据转换为text格式4.
5. import logging6. import os.path7. import sys8.
9. from gensim.corpora import WikiCorpus10.
11. if __name__ == '__main__':
12. program = os.path.basename(sys.argv[0])13. logger = logging.getLogger(program)14.
15. logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')16. logging.root.setLevel(level=logging.INFO)17. logger.info(\"running %s\" % ' '.join(sys.argv))18.
19. # check and process input arguments20. if len(sys.argv) < 3:
21. print globals()['__doc__'] % locals()22. sys.exit(1)
23. inp, outp = sys.argv[1:3]24. space = \" \"25. i = 026.
27. output = open(outp, 'w')
28. wiki = WikiCorpus(inp, lemmatize=False, dictionary={})29. for text in wiki.get_texts():
30. output.write(space.join(text) + \"\\n\")31. i = i + 1
32. if (i % 10000 == 0):
33. logger.info(\"Saved \" + str(i) + \" articles\")34.
35. output.close()
36. logger.info(\"Finished Saved \" + str(i) + \" articles\")得到信息:
1. 2016-08-11 20:39:22,739: INFO: running process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text2. 2016-08-11 20:40:08,329: INFO: Saved 10000 articles3. 2016-08-11 20:40:45,501: INFO: Saved 20000 articles4. 2016-08-11 20:41:23,659: INFO: Saved 30000 articles5. 2016-08-11 20:42:01,748: INFO: Saved 40000 articles6. 2016-08-11 20:42:33,779: INFO: Saved 50000 articles7. ......
8. 2016-08-11 20:55:23,094: INFO: Saved 200000 articles9. 2016-08-11 20:56:14,692: INFO: Saved 210000 articles10. 2016-08-11 20:57:04,614: INFO: Saved 220000 articles11. 2016-08-11 20:57:57,979: INFO: Saved 230000 articles
12. 2016-08-11 20:58:16,621: INFO: finished iterating over Wikipedia corpus of 232894 documents with 51603419 positions (total 2581444 articles, 62177405
positions before pruning articles shorter than 50 words)
13. 2016-08-11 20:58:16,622: INFO: Finished Saved 232894 articlesPython的话可⽤jieba完成分词,⽣成分词⽂件wiki.zh.text.seg 接着⽤word2vec⼯具训练:
python train_word2vec_model.py wiki.zh.text.seg wiki.zh.text.model wiki.zh.text.vector
1. #!/usr/bin/env python2. # -*- coding: utf-8 -*-3. # train_word2vec_model.py⽤于训练模型4.
5. import logging6. import os.path7. import sys
8. import multiprocessing9.
10. from gensim.corpora import WikiCorpus11. from gensim.models import Word2Vec
12. from gensim.models.word2vec import LineSentence13.
14. if __name__ == '__main__':
15. program = os.path.basename(sys.argv[0])16. logger = logging.getLogger(program)17.
18. logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')19. logging.root.setLevel(level=logging.INFO)20. logger.info(\"running %s\" % ' '.join(sys.argv))21.
22. # check and process input arguments23. if len(sys.argv) < 4:
24. print globals()['__doc__'] % locals()25. sys.exit(1)
26. inp, outp1, outp2 = sys.argv[1:4]27.
28. model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5,29. workers=multiprocessing.cpu_count())30.
31. # trim unneeded model memory = use(much) less RAM32. #model.init_sims(replace=True)33. model.save(outp1)
34. model.save_word2vec_format(outp2, binary=False)运⾏信息
1. 2016-08-12 09:50:02,586: INFO: running python train_word2vec_model.py wiki.zh.text.seg wiki.zh.text.model wiki.zh.text.vector2. 2016-08-12 09:50:02,592: INFO: collecting all words and their counts
3. 2016-08-12 09:50:02,592: INFO: PROGRESS: at sentence #0, processed 0 words and 0 word types
4. 2016-08-12 09:50:12,476: INFO: PROGRESS: at sentence #10000, processed 12914562 words and 254662 word types5. 2016-08-12 09:50:20,215: INFO: PROGRESS: at sentence #20000, processed 22308801 words and 373573 word types6. 2016-08-12 09:50:28,448: INFO: PROGRESS: at sentence #30000, processed 30724902 words and 460837 word types7. ...
8. 2016-08-12 09:52:03,498: INFO: PROGRESS: at sentence #210000, processed 143804601 words and 1483608 word types9. 2016-08-12 09:52:07,772: INFO: PROGRESS: at sentence #220000, processed 149352283 words and 1521199 word types10. 2016-08-12 09:52:11,639: INFO: PROGRESS: at sentence #230000, processed 154741839 words and 1563584 word types11. 2016-08-12 09:52:12,746: INFO: collected 1575172 word types from a corpus of 156430908 words and 232894 sentences12. 2016-08-12 09:52:13,672: INFO: total 278291 word types after removing those with count<513. 2016-08-12 09:52:13,673: INFO: constructing a huffman tree from 278291 words14. 2016-08-12 09:52:29,323: INFO: built huffman tree with maximum node depth 2515. 2016-08-12 09:52:29,683: INFO: resetting layer weights
16. 2016-08-12 09:52:38,805: INFO: training model with 4 workers on 278291 vocabulary and 400 features, using 'skipgram'=1 'hierarchical softmax'=1 'subsample'=0
and 'negative sampling'=0
17. 2016-08-12 09:52:49,504: INFO: PROGRESS: at 0.10% words, alpha 0.02500, 15008 words/s18. 2016-08-12 09:52:51,935: INFO: PROGRESS: at 0.38% words, alpha 0.02500, 44434 words/s19. 2016-08-12 09:52:54,779: INFO: PROGRESS: at 0.56% words, alpha 0.02500, 53965 words/s
20. 2016-08-12 09:52:57,240: INFO: PROGRESS: at 0.62% words, alpha 0.02491, 52116 words/s21. 2016-08-12 09:52:58,823: INFO: PROGRESS: at 0.72% words, alpha 0.02494, 55804 words/s22. 2016-08-12 09:53:03,649: INFO: PROGRESS: at 0.94% words, alpha 0.02486, 58277 words/s23. 2016-08-12 09:53:07,357: INFO: PROGRESS: at 1.03% words, alpha 0.02479, 56036 words/s24. ......
25. 2016-08-12 19:22:09,002: INFO: PROGRESS: at 98.38% words, alpha 0.00044, 85936 words/s26. 2016-08-12 19:22:10,321: INFO: PROGRESS: at 98.50% words, alpha 0.00044, 85971 words/s27. 2016-08-12 19:22:11,934: INFO: PROGRESS: at 98.55% words, alpha 0.00039, 85940 words/s28. 2016-08-12 19:22:13,384: INFO: PROGRESS: at 98.65% words, alpha 0.00036, 85960 words/s29. 2016-08-12 19:22:13,883: INFO: training on 152625573 words took 1775.1s, 85982 words/s
30. 2016-08-12 19:22:13,883: INFO: saving Word2Vec object under wiki.zh.text.model, separately None31. 2016-08-12 19:22:13,884: INFO: not storing attribute syn0norm
32. 2016-08-12 19:22:13,884: INFO: storing numpy array 'syn0' to wiki.zh.text.model.syn0.npy33. 2016-08-12 19:22:20,797: INFO: storing numpy array 'syn1' to wiki.zh.text.model.syn1.npy34. 2016-08-12 19:22:40,667: INFO: storing 278291x400 projection weights into wiki.zh.text.vector测试模型效果:
1. In [1]: import gensim2.
3. In [2]: model = gensim.models.Word2Vec.load(\"wiki.zh.text.model\")4.
5. In [3]: model.most_similar(u\"⾜球\")6. Out[3]:
7. [(u'\联\赛', 0.6553816199302673),8. (u'\甲\级', 0.6530429720878601),9. (u'\篮\球', 0.5967546701431274),
10. (u'\俱\乐\部', 0.5872289538383484),11. (u'\乙\级', 0.5840631723403931),
12. (u'\足\球\队', 0.5560152530670166),13. (u'\亚\足\联', 0.5308005809783936),14. (u'allsvenskan', 0.5249762535095215),
15. (u'\代\表\队', 0.5214947462081909),16. (u'\甲\组', 0.5177896022796631)]17.
18. In [4]: result = model.most_similar(u\"⾜球\")19.
20. In [5]: for e in result:21. print e[0], e[1]22. ....:
23. 联赛 0.6553816199324. 甲级 0.65304297208825. 篮球 0.59675467014326. 俱乐部 0.58722895383827. ⼄级 0.5840631723428. ⾜球队 0.55601525306729. 亚⾜联 0.53080058097830. allsvenskan 0.5249762535131. 代表队 0.52149474620832. 甲组 0.5177896022833.
34. In [6]: result = model.most_similar(u\"男⼈\")35.
36. In [7]: for e in result:37. print e[0], e[1]38. ....:
39. ⼥⼈ 0.7753712534940. 家伙 0.61736917495741. 妈妈 0.56710290908842. 漂亮 0.56083238124843. 0.54087501764344. 谎⾔ 0.53844869136845. 爸爸 0.5366094112446. 傻⽠ 0.53560805320747. 0.53515112400148. mc刘 0.529670000076
49.
50. In [8]: result = model.most_similar(u\"⼥⼈\")51.
52. In [9]: for e in result:53. print e[0], e[1]54. ....:
55. 男⼈ 0.7753712534956. 我的某 0.58901059627557. 妈妈 0.57634484767958. 0.56234097480859. 美丽 0.55542671680560. 爸爸 0.54395824670861. 新娘 0.54364049434762. 谎⾔ 0.54027283191763. 妞⼉ 0.53106617927664. ⽼婆 0.52852153778165.
66. In [10]: result = model.most_similar(u\"青蛙\")67.
68. In [11]: for e in result:69. print e[0], e[1]70. ....:
71. ⽼⿏ 0.55961287021672. 乌龟 0.48983103036973. 蜥蜴 0.47899052500774. 猫 0.4672884941175. 鳄鱼 0.46188539266676. 蟾蜍 0.44801419973477. 猴⼦ 0.43658402562178. ⽩雪公主 0.43490538001179. 蚯蚓 0.43341320753180. 螃蟹 0.431471228681.
82. In [12]: result = model.most_similar(u\"姨夫\")83.
84. In [13]: for e in result:85. print e[0], e[1]86. ....:
87. 堂伯 0.58393543958788. 祖⽗ 0.57473570108489. 妃所⽣ 0.56932711601390. 内弟 0.56201267242491. 早卒 0.55804264545492. 曕 0.55385601520593. 胤祯 0.55328851938294. 陈潜 0.55071699619395. 愔之 0.55051088333196. 叔⽗ 0.55003201961597.
98. In [14]: result = model.most_similar(u\"⾐服\")99.
100. In [15]: for e in result:101. print e[0], e[1]102. ....:
103. 鞋⼦ 0.686688780785104. 穿着 0.672499775887105. ⾐物 0.67173999548106. ⼤⾐ 0.667605519295107. 裤⼦ 0.662670075893108. 内裤 0.662210345268109. 裙⼦ 0.659705817699110. 西装 0.648508131504111. 洋装 0.647238850594112. 围裙 0.642895817757113.
114. In [16]: result = model.most_similar(u\"公安局\")115.
116. In [17]: for e in result:
117. print e[0], e[1]118. ....:
119. 司法局 0.730189085007120. 公安厅 0.634275555611121. 公安 0.612798035145122. 房管局 0.597343325615123. 商业局 0.597183346748124. 军管会 0.59476184845125. 体育局 0.59283208847126. 财政局 0.588721752167127. 戒毒所 0.575558543205128. 新闻办 0.573395550251129.
130. In [18]: result = model.most_similar(u\"铁道部\")131.
132. In [19]: for e in result:133. print e[0], e[1]134. ....:
135. 盛光祖 0.565509021282136. 交通部 0.548688530922137. 批复 0.546967327595138. 刘志军 0.541010737419139. ⽴项 0.517836689949140. 报送 0.510296344757141. 计委 0.508456230164142. ⽔利部 0.503531932831143. 国务院 0.503227233887144. 经贸委 0.50156635046145.
146. In [20]: result = model.most_similar(u\"清华⼤学\")147.
148. In [21]: for e in result:149. print e[0], e[1]150. ....:
151. 北京⼤学 0.763922810555152. 化学系 0.724210739136153. 物理系 0.694550514221154. 数学系 0.684280991554155. 中⼭⼤学 0.677202701569156. 复旦 0.657914161682157. 师范⼤学 0.656435549259158. 哲学系 0.654701948166159. ⽣物系 0.654403865337160. 中⽂系 0.653147578239161.
162. In [22]: result = model.most_similar(u\"卫视\")163.
164. In [23]: for e in result:165. print e[0], e[1]166. ....:
167. 湖南 0.676812887192168. 中⽂台 0.626506924629169. 収蔵 0.621356606483170. 黄⾦档 0.582251906395171. cctv 0.536769032478172. 安徽 0.536752820015173. ⾮同凡响 0.534517168999174. 唱响 0.533438682556175. 最强⾳ 0.532605051994176. ⾦鹰 0.531676828861177.
178. In [26]: result = model.most_similar(u\"林丹\")179.
180. In [27]: for e in result:181. print e[0], e[1]182. ....:
183. 黄综翰 0.538035452366184. 蒋燕皎 0.52646958828
185. 刘鑫 0.522252976894186. 韩晶娜 0.516120731831187. 王晓理 0.512289524078188. 王适 0.508560419083189. 杨影 0.508159279823190. 陈跃 0.507353425026191. 龚智超 0.503159761429192. 李敬元 0.50262516737193.
194. In [28]: result = model.most_similar(u\"语⾔学\")195.
196. In [29]: for e in result:197. print e[0], e[1]198. ....:
199. 社会学 0.632598280907200. ⼈类学 0.623406708241201. 历史学 0.618442356586202. ⽐较⽂学 0.604823827744203. ⼼理学 0.600066184998204. ⼈⽂科学 0.577783346176205. 社会⼼理学 0.575571238995206. 政治学 0.574541330338207. 地理学 0.573896467686208. 哲学 0.573873817921209.
210. In [30]: result = model.most_similar(u\"计算机\")211.
212. In [31]: for e in result:213. print e[0], e[1]214. ....:
215. ⾃动化 0.674171924591216. 应⽤ 0.614087462425217. ⾃动化系 0.611132860184218. 材料科学 0.607891201973219. 集成电路 0.600370049477220. 技术 0.597518980503221. 电⼦学 0.591316461563222. 建模 0.577238917351223. ⼯程学 0.572855889797224. 微电⼦ 0.570086717606225.
226. In [32]: model.similarity(u\"计算机\⾃动化\")227. Out[32]: 0.67417196002404789228.
229. In [33]: model.similarity(u\"⼥⼈\男⼈\")230. Out[33]: 0.77537125129824813231.
232. In [34]: model.doesnt_match(u\"早餐 晚餐 午餐 中⼼\".split())233. Out[34]: u'\中\心'234.
235. In [35]: print model.doesnt_match(u\"早餐 晚餐 午餐 中⼼\".split())236. 中⼼+
因篇幅问题不能全部显示,请点此查看更多更全内容