亚洲中文字幕久久精品无码APP ,国产AV无码专区亚洲AV男同,精品亚洲一区二区三区在线观看

利用中文數(shù)據(jù)跑Google開源項(xiàng)目word2vec

http://www.cnblogs.com/hebin/p/3507609.html

一直聽說(shuō)word2vec在處理詞與詞的相似度的問(wèn)題上效果十分好，最近自己也上手跑了跑Google開源的代碼（https://code.google.com/p/word2vec/）。

1、語(yǔ)料

首先準(zhǔn)備數(shù)據(jù)：采用網(wǎng)上博客上推薦的全網(wǎng)新聞數(shù)據(jù)(SogouCA)，大小為2.1G。

從ftp上下載數(shù)據(jù)包SogouCA.tar.gz：

1 wget ftp://ftp.labs.sogou.com/Data/SogouCA/SogouCA.tar.gz --ftp-user=hebin_hit@foxmail.com --ftp-password=4FqLSYdNcrDXvNDi -r

解壓數(shù)據(jù)包：

1 gzip -d SogouCA.tar.gz 2 tar -xvf SogouCA.tar

再將生成的txt文件歸并到SogouCA.txt中，取出其中包含content的行并轉(zhuǎn)碼，得到語(yǔ)料corpus.txt，大小為2.7G。

1 cat *.txt > SogouCA.txt 2 cat SogouCA.txt | iconv -f gbk -t utf-8 -c | grep "<content>" > corpus.txt

2、分詞

用ANSJ對(duì)corpus.txt進(jìn)行分詞，得到分詞結(jié)果resultbig.txt，大小為3.1G。

分詞工具ANSJ參見 http://blog.csdn.net/zhaoxinfan/article/details/10403917

在分詞工具seg_tool目錄下先編譯再執(zhí)行得到分詞結(jié)果resultbig.txt，內(nèi)含426221個(gè)詞，次數(shù)總計(jì)572308385個(gè)。

分詞結(jié)果：

3、用word2vec工具訓(xùn)練詞向量

1 nohup ./word2vec -train resultbig.txt -output vectors.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 1 &

vectors.bin是word2vec處理resultbig.txt后生成的詞的向量文件，在實(shí)驗(yàn)室的服務(wù)器上訓(xùn)練了1個(gè)半小時(shí)。

4、分析

4.1 計(jì)算相似的詞：

1 ./distance vectors.bin

./distance可以看成計(jì)算詞與詞之間的距離，把詞看成向量空間上的一個(gè)點(diǎn)，distance看成向量空間上點(diǎn)與點(diǎn)的距離。

下面是一些例子：

4.2 潛在的語(yǔ)言學(xué)規(guī)律

在對(duì)demo-analogy.sh修改后得到下面幾個(gè)例子：

法國(guó)的首都是巴黎，英國(guó)的首都是倫敦， vector("法國(guó)") - vector("巴黎) + vector("英國(guó)") --> vector("倫敦")"

4.3 聚類

將經(jīng)過(guò)分詞后的語(yǔ)料resultbig.txt中的詞聚類并按照類別排序：

1 nohup ./word2vec -train resultbig.txt -output classes.txt -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -classes 500  & 2 sort classes.txt -k 2 -n > classes_sorted_sogouca.txt

例如：

4.4 短語(yǔ)分析

先利用經(jīng)過(guò)分詞的語(yǔ)料resultbig.txt中得出包含詞和短語(yǔ)的文件sogouca_phrase.txt，再訓(xùn)練該文件中詞與短語(yǔ)的向量表示。

1 ./word2phrase -train resultbig.txt -output sogouca_phrase.txt -threshold 500 -debug 2 2 ./word2vec -train sogouca_phrase.txt -output vectors_sogouca_phrase.bin -cbow 0 -size 300 -window 10 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 1

下面是幾個(gè)計(jì)算相似度的例子：

5、參考鏈接：

1. word2vec：Tool for computing continuous distributed representations of words，https://code.google.com/p/word2vec/

2. 用中文把玩Google開源的Deep-Learning項(xiàng)目word2vec，http://www.cnblogs.com/wowarsenal/p/3293586.html

3. 利用word2vec對(duì)關(guān)鍵詞進(jìn)行聚類，http://blog.csdn.net/zhaoxinfan/article/details/11069485

6、后續(xù)準(zhǔn)備仔細(xì)閱讀的文獻(xiàn)：

[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.
[2] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.
[3] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of NAACL HLT, 2013.

[4] Collobert R, Weston J, Bottou L, et al. Natural language processing (almost) from scratch[J]. The Journal of Machine Learning Research, 2011, 12: 2493-2537.

posted on 2016-01-13 13:49 SIMONE 閱讀(1388) 評(píng)論(0) 編輯收藏

常用鏈接

留言簿(46)

隨筆分類(476)

隨筆檔案(495)

最新隨筆

搜索

積分與排名

最新評(píng)論

閱讀排行榜

評(píng)論排行榜


只有注冊(cè)用戶登錄后才能發(fā)表評(píng)論。




網(wǎng)站導(dǎo)航: 博客園 IT新聞 Chat2DB C++博客博問(wèn)