自然语言处理:网购商品评论情感判定

目录

1、项目背景

2、数据集

3、数据预处理

4、基于SVM的情感分类模型

5、基于word2vec中doc2vec的无监督分类模型


自然语言处理(Natural Language Processing,简称NLP),是为各类企业及开发者提供的用于文本分析及挖掘的核心工具,旨在帮助用户高效的处理文本,已经广泛应用在电商、文娱、司法、公安、金融、医疗、电力等行业客户的多项业务中,取得了良好的效果。

1、项目背景

任何行业领域,用户对产品的评价都显得尤为重要。通过用户评论,可以对用户情感倾向进行判定。

例如,目前最为普遍的网购行为:对于用户来说,参考评论可以做出更优的购买决策;对于商家来说,对商品评论按照情感倾向进行分类,并通过文本聚类得到普遍提及的商品优缺点,可以进一步改良产品。

本案例主要讨论如何对商品评论进行情感倾向判定。下图为某电商平台上针对某款手机的部分评论:

2、数据集

这份某款手机的商品评论信息数据集,包含2个属性,共计8187个样本。

使用Pandas中的read_excel函数读取xls格式的数据集文件,注意文件的编码设置为gb18030,代码如下所示:

  1. import pandas as pd
  2. #读入数据集
  3. data = pd.read_excel("data.xls", encoding='gb18030')
  4. print(data.head())

读取数据集效果(部分)如下所示:

查看数据集的相关信息,包括行列数,列名,以及各个类别的样本数,实现代码如下所示:

  1. # 数据集的大小
  2. print(data.shape)
  3. # 数据集的列名
  4. print(data.columns.values)
  5. # 不同类别数据记录的统计
  6. print(data['Class'].value_counts())

效果如下所示

  1. (8186, 2)
  2. array([u'Comment', u'Class'], dtype=object)
  3. 1 3042
  4. -1 2657
  5. 0 2487
  6. Name: Class, dtype: int64

3、数据预处理

现在,我们要将Comment列的文本信息,转化成数值矩阵表示,也就是将文本映射到特征空间。

首先,通过jieba,使用HMM模型,对文本进行中文分词,实现代码如下所示:

  1. # 导入中文分词库jieba
  2. import jieba
  3. import numpy as np

接下来,对数据集的每个样本的文本进行中文分词,如遇到缺失值,使用“还行、一般吧”进行填充,实现代码如下所示:

  1. cutted = []
  2. for row in data.values:
  3. try:
  4. raw_words = (" ".join(jieba.cut(row[0])))
  5. cutted.append(raw_words)
  6. except AttributeError:
  7. print row[0]
  8. cutted.append(u"还行 一般吧")
  9. cutted_array = np.array(cutted)
  10. # 生成新数据文件,Comment字段为分词后的内容
  11. data_cutted = pd.DataFrame({
  12. 'Comment': cutted_array,
  13. 'Class': data['Class']
  14. })

读取并查看预处理后的数据,实现代码如下所示:

print(data_cutted.head())

数据集效果(部分)如下所示:

为了更直观地观察词频高的词语,我们使用第三方库wordcloud进行文本的可视化,导入库实现代码如下所示:

  1. # 导入第三方库wordcloud
  2. from wordcloud import WordCloud
  3. import matplotlib.pyplot as plt

针对好评,中评和差评的文本,建立WordCloud对象,绘制词云,好评词云可视化实现代码如下所示:

  1. # 好评
  2. wc = WordCloud(font_path='Courier.ttf')
  3. wc.generate(''.join(data_cutted['Comment'][data_cutted['Class'] == 1]))
  4. plt.axis('off')
  5. plt.imshow(wc)
  6. plt.show()

好评词云效果如下所示:

中评词云可视化实现代码如下所示:

  1. # 中评
  2. wc = WordCloud(font_path='Courier.ttf')
  3. wc.generate(''.join(data_cutted['Comment'][data_cutted['Class'] == 0]))
  4. plt.axis('off')
  5. plt.imshow(wc)
  6. plt.show()

中评词云效果如下所示:

差评词云可视化实现代码如下所示:

  1. # 差评
  2. wc = WordCloud(font_path='Courier.ttf')
  3. wc.generate(''.join(data_cutted['Comment'][data_cutted['Class'] == -1]))
  4. plt.axis('off')
  5. plt.imshow(wc)
  6. plt.show()

差评词云效果如下所示:



















       从词云展现的词频统计图来看,"手机","就是","屏幕","收到"等词对于区分毫无帮助而且会造成偏差。因此,需要把这些对区分类没有意义的词语筛选出来,放到停用词文件stopwords.txt中。实现代码如下所示:

  1. # 读入停用词文件
  2. import codecs
  3. with codecs.open('stopwords.txt', 'r', encoding='utf-8') as f:
  4. stopwords = [item.strip() for item in f]
  5. for item in stopwords[0:200]:
  6. print(item,)

输出停用词效果如下所示:

使用jieba库的extract_tags函数,统计好评,中评,差评文本中的TOP20关键词。

  1. #设定停用词文件,在统计关键词的时候,过滤停用词
  2. import jieba.analyse
  3. jieba.analyse.set_stop_words('stopwords.txt')

好评关键词分析,实现代码如下所示:

  1. # 好评关键词
  2. keywords_pos = jieba.analyse.extract_tags(''.join(data_cutted['Comment']
  3. [data_cutted['Class'] == 1]), topK=20)
  4. for item in keywords_pos:
  5. print(item,)

好评关键词TOP20如下所示:

不错 正品 赠品 五分 发货 东西 满意 机子 喜欢 收到 很漂亮 充电 好评 很快 卖家 速度 评价 流畅 快递 物流

中评关键词分析,实现代码如下所示:

  1. #中评关键词
  2. keywords_med = jieba.analyse.extract_tags(''.join(data_cutted['Comment'][data_cutted
  3. ['Class'] == 0]), topK=20)
  4. for item in keywords_med:
  5. print(item,)

中评关键词TOP20如下所示:

充电 不错 发热 外观 感觉 电池 机子 问题 赠品 有点 无线 发烫 换货 软件 快递 安卓 内存 退货 知道 售后

差评关键词分析,实现代码如下所示:

  1. #差评关键词
  2. keywords_neg = jieba.analyse.extract_tags(''.join(data_cutted['Comment'][data_cutted
  3. ['Class'] == -1]), topK=20)
  4. for item in keywords_neg:
  5. print(item,)

差评关键词TOP20如下所示:

差评 售后 垃圾 赠品 退货 问题 换货 充电 降价 发票 充电器 东西 刚买 发热 无线 机子 死机 收到 质量 15

经过以上步骤的处理,整个数据集的预处理工作“告一段落”。在中文文本分析和情感分析的工作中,数据预处理的内容主要是分词。只有经过分词处理后的文本数据集才可以进行下一步的向量化操作,满足输入模型的条件。

4、基于SVM的情感分类模型

经过分词之后的文本数据集要先进行向量化之后才能输入到分类模型中进行运算。

我们使用sklearn库实现向量化方法,去掉停用词,并将其通过tftf-idf映射到特征空间。





          其中,tftf为词频,即分词后每个词项在该条评论中出现的次数;dfdf为出现该词项评论数目;NN为评论总数,使用对数来适当抑制tftf和dfdf值的影响。





         我们使用sklearn库中的函数直接实现SVM算法。在这里,我们选取以下形式的SVM模型参与运算。





        为了方便,创建文本情感分析类CommentClassifier,来实现建模过程:

  • __init__为类的初始化函数,输入参数classifier_typevector_type,分别代表分类模型的类型和向量化方法的类型。
  • fit()函数,来实现向量化与模型建立的过程。

实现代码如下所示:

  1. # 实现向量化方法
  2. from sklearn.feature_extraction.text import TfidfVectorizer
  3. from sklearn.feature_extraction.text import CountVectorizer
  4. #实现svm和贝叶斯模型
  5. from sklearn.svm import SVC
  6. from sklearn.svm import LinearSVC
  7. from sklearn.linear_model import SGDClassifier
  8. # 实现交叉验证
  9. from sklearn.cross_validation import train_test_split
  10. from sklearn.cross_validation import cross_val_score
  11. # 实现评价指标
  12. from sklearn import metrics
  13. # 文本情感分类的类:CommentClassifier
  14. class CommentClassifier:
  15. def __init__(self, classifier_type, vector_type):
  16. self.classifier_type = classifier_type #分类器类型:支持向量机或贝叶斯分类
  17. self.vector_type = vector_type #文本向量化模型:0\1模型,TF模型,
  18. TF-IDF模型
  19. def fit(self, train_x, train_y, max_df):
  20. list_text = list(train_x)
  21. #向量化方法:0 - 0/1,1 - TF,2 - TF-IDF
  22. if self.vector_type == 0:
  23. self.vectorizer = CountVectorizer(max_df, stop_words = stopwords,
  24. ngram_range=(1, 3)).fit(list_text)
  25. elif self.vector_type == 1:
  26. self.vectorizer = TfidfVectorizer(max_df, stop_words = stopwords,
  27. ngram_range=(1, 3), use_idf=False).fit(list_text)
  28. else:
  29. self.vectorizer = TfidfVectorizer(max_df, stop_words = stopwords,
  30. ngram_range=(1, 3)).fit(list_text)
  31. self.array_trainx = self.vectorizer.transform(list_text)
  32. self.array_trainy = train_y
  33. #分类模型选择:1 - SVC,2 - LinearSVC,3 - SGDClassifier,三种SVM模型
  34. if self.classifier_type == 1:
  35. self.model = SVC(kernel='linear', gamma=10 ** -5, C=1).fit
  36. (self.array_trainx, self.array_trainy)
  37. elif self.classifier_type == 2:
  38. self.model = LinearSVC().fit(self.array_trainx, self.array_trainy)
  39. else:
  40. self.model = SGDClassifier().fit(self.array_trainx, self.array_trainy)
  41. def predict_value(self, test_x):
  42. list_text = list(test_x)
  43. self.array_testx = self.vectorizer.transform(list_text)
  44. array_predict = self.model.predict(self.array_testx)
  45. return array_predict
  46. def predict_proba(self, test_x):
  47. list_text = list(test_x)
  48. self.array_testx = self.vectorizer.transform(list_text)
  49. array_score = self.model.predict_proba(self.array_testx)
  50. return array_score
  • 使用train_test_split()函数划分训练集和测试集。训练集:80%;测试集:20%。
  • 建立classifier_typevector_type两个参数的取值列表,来表示选择的向量化方法以及分类模型
  • 输出每种向量化方法和分类模型的组合所对应的分类评价结果,内容包括混淆矩阵以及含PrecisionRecallF1-score三个指标的评分矩阵

实现代码如下所示:

  1. #划分训练集,测试集
  2. train_x, test_x, train_y, test_y = train_test_split(data_cutted['Comment'].ravel().
  3. astype('U'), data_cutted['Class'].ravel(),
  4. test_size=0.2, random_state=4)
  5. classifier_list = [1,2,3]
  6. vector_list = [0,1,2]
  7. for classifier_type in classifier_list:
  8. for vector_type in vector_list:
  9. commentCls = CommentClassifier(classifier_type, vector_type)
  10. #max_df 设置为0.98
  11. commentCls.fit(train_x, train_y, 0.98)
  12. if classifier_type == 0:
  13. value_result = commentCls.predict_value(test_x)
  14. proba_result = commentCls.predict_proba(test_x)
  15. print(classifier_type,vector_type)
  16. print('classification report')
  17. print(metrics.classification_report(test_y, value_result, labels=
  18. [-1, 0, 1]))
  19. print('confusion matrix')
  20. print(metrics.confusion_matrix(test_y, value_result, labels=
  21. [-1, 0, 1]))
  22. else:
  23. value_result = commentCls.predict_value(test_x)
  24. print(classifier_type,vector_type)
  25. print('classification report')
  26. print(metrics.classification_report(test_y, value_result, labels=
  27. [-1, 0, 1]))
  28. print('confusion matrix')
  29. print(metrics.confusion_matrix(test_y, value_result, labels=[-1, 0, 1]))

输出效果如下所示:

  1. 1 0
  2. classification report
  3. precision recall f1-score support
  4. -1 0.68 0.62 0.65 519
  5. 0 0.55 0.49 0.52 485
  6. 1 0.75 0.86 0.80 634
  7. avg / total 0.67 0.68 0.67 1638
  8. confusion matrix
  9. [[324 130 65]
  10. [131 236 118]
  11. [ 24 64 546]]
  12. 1 1
  13. classification report
  14. precision recall f1-score support
  15. -1 0.71 0.74 0.72 519
  16. 0 0.58 0.54 0.56 485
  17. 1 0.84 0.85 0.85 634
  18. avg / total 0.72 0.72 0.72 1638
  19. confusion matrix
  20. [[385 109 25]
  21. [145 263 77]
  22. [ 15 80 539]]
  23. 1 2
  24. classification report
  25. precision recall f1-score support
  26. -1 0.70 0.74 0.72 519
  27. 0 0.58 0.52 0.55 485
  28. 1 0.84 0.86 0.85 634
  29. avg / total 0.72 0.72 0.72 1638
  30. confusion matrix
  31. [[386 106 27]
  32. [151 254 80]
  33. [ 14 76 544]]
  34. 2 0
  35. classification report
  36. precision recall f1-score support
  37. -1 0.70 0.62 0.66 519
  38. 0 0.56 0.51 0.54 485
  39. 1 0.76 0.88 0.82 634
  40. avg / total 0.68 0.69 0.68 1638
  41. confusion matrix
  42. [[320 135 64]
  43. [122 248 115]
  44. [ 16 57 561]]
  45. 2 1
  46. classification report
  47. precision recall f1-score support
  48. -1 0.69 0.73 0.71 519
  49. 0 0.61 0.48 0.54 485
  50. 1 0.81 0.91 0.86 634
  51. avg / total 0.71 0.73 0.72 1638
  52. confusion matrix
  53. [[377 108 34]
  54. [154 233 98]
  55. [ 12 44 578]]
  56. 2 2
  57. classification report
  58. precision recall f1-score support
  59. -1 0.70 0.74 0.72 519
  60. 0 0.61 0.50 0.55 485
  61. 1 0.83 0.91 0.87 634
  62. avg / total 0.72 0.73 0.73 1638
  63. confusion matrix
  64. [[383 108 28]
  65. [154 241 90]
  66. [ 13 43 578]]
  67. 3 0
  68. classification report
  69. precision recall f1-score support
  70. -1 0.69 0.69 0.69 519
  71. 0 0.58 0.47 0.52 485
  72. 1 0.79 0.90 0.84 634
  73. avg / total 0.70 0.71 0.70 1638
  74. confusion matrix
  75. [[359 118 42]
  76. [148 228 109]
  77. [ 14 47 573]]
  78. 3 1
  79. classification report
  80. precision recall f1-score support
  81. -1 0.70 0.74 0.72 519
  82. 0 0.60 0.49 0.54 485
  83. 1 0.81 0.88 0.84 634
  84. avg / total 0.71 0.72 0.71 1638
  85. confusion matrix
  86. [[386 96 37]
  87. [152 240 93]
  88. [ 13 66 555]]
  89. 3 2
  90. classification report
  91. precision recall f1-score support
  92. -1 0.65 0.75 0.69 519
  93. 0 0.63 0.49 0.55 485
  94. 1 0.83 0.86 0.85 634
  95. avg / total 0.71 0.72 0.71 1638
  96. confusion matrix
  97. [[389 98 32]
  98. [169 236 80]
  99. [ 45 41 548]]

从结果上来看,选择tfidf向量化方法,使用LinearSVC模型效果比较好,f1-socre为0.73

从混淆矩阵来看,我们会发现多数的错误分类都出现在中评和差评上。我们可以将原始数据集的中评删除。实现代码如下所示:

  1. data_bi = data_cutted[data_cutted['Class'] != 0]
  2. data_bi['Class'].value_counts()

效果如下所示:

  1. 1 3042
  2. -1 2658
  3. Name: Class, dtype: int64

再次运行分类模型,查看分类结果,如下所示:

  1. 1 0
  2. classification report
  3. precision recall f1-score support
  4. -1 0.90 0.79 0.84 537
  5. 1 0.83 0.92 0.87 603
  6. avg / total 0.86 0.86 0.86 1140
  7. confusion matrix
  8. [[425 112]
  9. [ 48 555]]
  10. 1 1
  11. classification report
  12. precision recall f1-score support
  13. -1 0.87 0.92 0.90 537
  14. 1 0.93 0.88 0.90 603
  15. avg / total 0.90 0.90 0.90 1140
  16. confusion matrix
  17. [[496 41]
  18. [ 71 532]]
  19. 1 2
  20. classification report
  21. precision recall f1-score support
  22. -1 0.88 0.93 0.90 537
  23. 1 0.93 0.88 0.91 603
  24. avg / total 0.90 0.90 0.90 1140
  25. confusion matrix
  26. [[497 40]
  27. [ 70 533]]
  28. 2 0
  29. classification report
  30. precision recall f1-score support
  31. -1 0.90 0.80 0.85 537
  32. 1 0.84 0.92 0.88 603
  33. avg / total 0.87 0.86 0.86 1140
  34. confusion matrix
  35. [[431 106]
  36. [ 48 555]]
  37. 2 1
  38. classification report
  39. precision recall f1-score support
  40. -1 0.92 0.91 0.91 537
  41. 1 0.92 0.93 0.92 603
  42. avg / total 0.92 0.92 0.92 1140
  43. confusion matrix
  44. [[486 51]
  45. [ 43 560]]
  46. 2 2
  47. classification report
  48. precision recall f1-score support
  49. -1 0.93 0.91 0.92 537
  50. 1 0.92 0.94 0.93 603
  51. avg / total 0.92 0.92 0.92 1140
  52. confusion matrix
  53. [[488 49]
  54. [ 39 564]]
  55. 3 0
  56. classification report
  57. precision recall f1-score support
  58. -1 0.92 0.82 0.87 537
  59. 1 0.86 0.94 0.90 603
  60. avg / total 0.89 0.88 0.88 1140
  61. confusion matrix
  62. [[443 94]
  63. [ 38 565]]
  64. 3 1
  65. classification report
  66. precision recall f1-score support
  67. -1 0.92 0.91 0.91 537
  68. 1 0.92 0.93 0.92 603
  69. avg / total 0.92 0.92 0.92 1140
  70. confusion matrix
  71. [[486 51]
  72. [ 41 562]]
  73. 3 2
  74. classification report
  75. precision recall f1-score support
  76. -1 0.88 0.93 0.90 537
  77. 1 0.93 0.89 0.91 603
  78. avg / total 0.91 0.91 0.91 1140
  79. confusion matrix
  80. [[497 40]
  81. [ 67 536]]

删除差评之后,不同组合的分类模型效果均有显著提升。这也说明,分类模型能够有效地将好评区分出来。

数据集中存在标注不准确的问题,主要集中在中评。由于人在评论时,除非有问题否则一般都会打好评,如果打了中评说明对产品有不满意之处,在情感的表达上就会趋向于负向情感,同时评论具有很大主观性,很多中评会将其归为差评,但数据集中却认为是中评。因此,将一条评论分类为好评、中评、差评是不够客观,中评与差评之间的边界很模糊,因此识别率很难提高。

5、基于word2vec中doc2vec的无监督分类模型

开源文本向量化工具word2vec,可以为文本数据寻求更加深层次的特征表示。词语之间可以进行运算:

w2v(woman)-w2v(man)+w2v(king)=w2v(queen)

基于word2vec的doc2vec,将每个文档表示为一个向量,并且通过余弦距离可以计算两个文档的相似程度,那么就可以计算一句话和一句极好的好评的距离,以及一句话到极差的差评的距离。

在本案例的数据集中:

  • 好评:快 就是 手感 满意 也好 喜欢 也 流畅 很 服务态度 实用 超快 挺快 用着 速度 礼品 也不错 非常好 挺好 感觉 才来 还行 好看 也快 不错的 送了 非常不错 超级 赞 好多东西 很实用 各方面 挺好的 很多 漂亮 配件 还不错 也多 特意 慢 满分 好用 非常漂亮......
  • 差评:不多说 上当 差差 刚用 服务差 一点也不 不要 简直 还是去 实体店 大家 保证 不肯 生气 开发票 磨损 后悔 印记 网 什么破 烂烂 左边 失效 太 骗 掉价 走下坡路 不说了 彻底 三星手机 自营 几次 真心 别的 看完 简单说 机会 这是 生气了 触动 缝隙 冲动了 失望......

我们使用第三方库gensim来实现doc2vec模型。

实现代码如下所示:

  1. import pandas as pd
  2. from gensim.models import Doc2Vec
  3. from gensim.models.doc2vec import TaggedDocument
  4. import logging
  5. logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
  6. level=logging.INFO)
  7. train_x = data_bi['Comment'].ravel()
  8. train_y = data_bi['Class'].ravel()
  9. #为train_x列贴上标签"TRAIN"
  10. def labelizeReviews(reviews, label_type):
  11. labelized = []
  12. for i, v in enumerate(reviews):
  13. label = '%s_%s' % (label_type, i)
  14. labelized.append(TaggedDocument(v.split(" "), [label]))
  15. return labelized
  16. train_x = labelizeReviews(train_x, "TRAIN")
  17. #建立Doc2Vec模型model
  18. size = 300
  19. all_data = []
  20. all_data.extend(train_x)
  21. model = Doc2Vec(min_count=1, window=8, size=size, sample=1e-4, negative=5,
  22. hs=0, iter=5, workers=8)
  23. model.build_vocab(all_data)
  24. # 设置迭代次数10
  25. for epoch in range(10):
  26. model.train(train_x)
  27. #建立空列表pos和neg以对相似度计算结果进行存储,计算每个评论和极好评论之间的余弦距离,
  28. 并存在pos列表中
  29. #计算每个评论和极差评论之间的余弦距离,并存在neg列表中
  30. pos = []
  31. neg = []
  32. for i in range(0,len(train_x)):
  33. pos.append(model.docvecs.similarity("TRAIN_0","TRAIN_{}".format(i)))
  34. neg.append(model.docvecs.similarity("TRAIN_1","TRAIN_{}".format(i)))
  35. #将pos列表和neg列表更新到原始数据文件中,分别表示为字段PosSim和字段NegSim
  36. data_bi[u'PosSim'] = pos
  37. data_bi[u'NegSim'] = neg

模型训练过程如下所示:

  1. 2017-05-27 14:30:28,393 : INFO : collecting all words and their counts
  2. 2017-05-27 14:30:28,394 : INFO : PROGRESS: at example #0, processed 0 words (0/s),
  3. 0 word types, 0 tags
  4. 2017-05-27 14:30:28,593 : INFO : collected 10545 word types and 5700 unique tags
  5. from a corpus of 5700 examples and 482148 words
  6. 2017-05-27 14:30:28,595 : INFO : Loading a fresh vocabulary
  7. 2017-05-27 14:30:28,649 : INFO : min_count=1 retains 10545 unique words
  8. (100% of original 10545, drops 0)
  9. 2017-05-27 14:30:28,650 : INFO : min_count=1 leaves 482148 word corpus
  10. (100% of original 482148, drops 0)
  11. 2017-05-27 14:30:28,705 : INFO : deleting the raw counts dictionary of 10545 items
  12. 2017-05-27 14:30:28,706 : INFO : sample=0.0001 downsamples 217 most-common words
  13. 2017-05-27 14:30:28,707 : INFO : downsampling leaves estimated 108356 word corpus
  14. (22.5% of prior 482148)
  15. 2017-05-27 14:30:28,709 : INFO : estimated required memory for 10545 words and
  16. 300 dimensions: 38560500 bytes
  17. 2017-05-27 14:30:28,784 : INFO : resetting layer weights
  18. 2017-05-27 14:30:29,120 : INFO : training model with 8 workers on 10545 vocabulary
  19. and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=8
  20. 2017-05-27 14:30:29,121 : INFO : expecting 5700 sentences, matching count from
  21. corpus used for vocabulary survey
  22. 2017-05-27 14:30:30,176 : INFO : PROGRESS: at 10.24% examples, 72316 words/s,
  23. in_qsize 15, out_qsize 0
  24. 2017-05-27 14:30:31,211 : INFO : PROGRESS: at 29.96% examples, 91057 words/s,
  25. in_qsize 16, out_qsize 0
  26. 2017-05-27 14:30:32,218 : INFO : PROGRESS: at 66.30% examples, 126742 words/s,
  27. in_qsize 15, out_qsize 0
  28. 2017-05-27 14:30:33,231 : INFO : PROGRESS: at 86.00% examples, 122698 words/s,
  29. in_qsize 15, out_qsize 0
  30. 2017-05-27 14:30:33,571 : INFO : worker thread finished; awaiting finish of 7
  31. more threads
  32. 2017-05-27 14:30:33,573 : INFO : worker thread finished; awaiting finish of 6
  33. more threads
  34. 2017-05-27 14:30:33,605 : INFO : worker thread finished; awaiting finish of 5
  35. more threads
  36. 2017-05-27 14:30:33,647 : INFO : worker thread finished; awaiting finish of 4
  37. more threads
  38. 2017-05-27 14:30:33,678 : INFO : worker thread finished; awaiting finish of 3
  39. more threads
  40. 2017-05-27 14:30:33,696 : INFO : worker thread finished; awaiting finish of 2
  41. more threads
  42. 2017-05-27 14:30:33,711 : INFO : worker thread finished; awaiting finish of 1
  43. more threads
  44. 2017-05-27 14:30:33,722 : INFO : worker thread finished; awaiting finish of 0
  45. more threads
  46. 2017-05-27 14:30:33,724 : INFO : training on 2410740 raw words (570332 effective words)
  47. took 4.6s, 124032 effective words/s
  48. 2017-05-27 14:30:33,727 : INFO : training model with 8 workers on 10545 vocabulary and
  49. 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=8
  50. 2017-05-27 14:30:33,731 : INFO : expecting 5700 sentences, matching count from corpus
  51. used for vocabulary survey
  52. 2017-05-27 14:30:34,753 : INFO : PROGRESS: at 36.38% examples, 212225 words/s,
  53. in_qsize 15, out_qsize 0
  54. 2017-05-27 14:30:35,762 : INFO : PROGRESS: at 75.24% examples, 216859 words/s,
  55. in_qsize 16, out_qsize 0
  56. 2017-05-27 14:30:36,243 : INFO : worker thread finished; awaiting finish of 7
  57. more threads
  58. 2017-05-27 14:30:36,244 : INFO : worker thread finished; awaiting finish of 6
  59. more threads
  60. 2017-05-27 14:30:36,264 : INFO : worker thread finished; awaiting finish of 5
  61. more threads
  62. 2017-05-27 14:30:36,306 : INFO : worker thread finished; awaiting finish of 4
  63. more threads
  64. 2017-05-27 14:30:36,311 : INFO : worker thread finished; awaiting finish of 3
  65. more threads
  66. 2017-05-27 14:30:36,320 : INFO : worker thread finished; awaiting finish of 2
  67. more threads
  68. 2017-05-27 14:30:36,330 : INFO : worker thread finished; awaiting finish of 1
  69. more threads
  70. 2017-05-27 14:30:36,336 : INFO : worker thread finished; awaiting finish of 0
  71. more threads
  72. 2017-05-27 14:30:36,338 : INFO : training on 2410740 raw words (570008 effective words)
  73. took 2.6s, 219523 effective words/s
  74. 2017-05-27 14:30:36,339 : INFO : training model with 8 workers on 10545 vocabulary
  75. and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=8
  76. 2017-05-27 14:30:36,341 : INFO : expecting 5700 sentences, matching count from
  77. corpus used for vocabulary survey
  78. 2017-05-27 14:30:37,353 : INFO : PROGRESS: at 28.23% examples, 177496 words/s,
  79. in_qsize 16, out_qsize 0
  80. 2017-05-27 14:30:38,372 : INFO : PROGRESS: at 66.30% examples, 193880 words/s,
  81. in_qsize 16, out_qsize 0
  82. 2017-05-27 14:30:39,061 : INFO : worker thread finished; awaiting finish of 7
  83. more threads
  84. 2017-05-27 14:30:39,062 : INFO : worker thread finished; awaiting finish of 6
  85. more threads
  86. 2017-05-27 14:30:39,074 : INFO : worker thread finished; awaiting finish of 5
  87. more threads
  88. 2017-05-27 14:30:39,115 : INFO : worker thread finished; awaiting finish of 4
  89. more threads
  90. 2017-05-27 14:30:39,122 : INFO : worker thread finished; awaiting finish of 3
  91. more threads
  92. 2017-05-27 14:30:39,132 : INFO : worker thread finished; awaiting finish of 2
  93. more threads
  94. 2017-05-27 14:30:39,147 : INFO : worker thread finished; awaiting finish of 1
  95. more threads
  96. 2017-05-27 14:30:39,154 : INFO : worker thread finished; awaiting finish of 0
  97. more threads
  98. 2017-05-27 14:30:39,155 : INFO : training on 2410740 raw words (570746 effective words)
  99. took 2.8s, 203312 effective words/s
  100. 2017-05-27 14:30:39,158 : INFO : training model with 8 workers on 10545 vocabulary
  101. and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=8
  102. 2017-05-27 14:30:39,159 : INFO : expecting 5700 sentences, matching count from corpus
  103. used for vocabulary survey
  104. 2017-05-27 14:30:40,168 : INFO : PROGRESS: at 37.74% examples, 222816 words/s,
  105. in_qsize 16, out_qsize 0
  106. 2017-05-27 14:30:41,177 : INFO : PROGRESS: at 77.55% examples, 223202 words/s,
  107. in_qsize 16, out_qsize 0
  108. 2017-05-27 14:30:41,605 : INFO : worker thread finished; awaiting finish of 7
  109. more threads
  110. 2017-05-27 14:30:41,610 : INFO : worker thread finished; awaiting finish of 6
  111. more threads
  112. 2017-05-27 14:30:41,614 : INFO : worker thread finished; awaiting finish of 5
  113. more threads
  114. 2017-05-27 14:30:41,645 : INFO : worker thread finished; awaiting finish of 4
  115. more threads
  116. 2017-05-27 14:30:41,670 : INFO : worker thread finished; awaiting finish of 3
  117. more threads
  118. 2017-05-27 14:30:41,674 : INFO : worker thread finished; awaiting finish of 2
  119. more threads
  120. 2017-05-27 14:30:41,682 : INFO : worker thread finished; awaiting finish of 1
  121. more threads
  122. 2017-05-27 14:30:41,690 : INFO : worker thread finished; awaiting finish of 0
  123. more threads
  124. 2017-05-27 14:30:41,692 : INFO : training on 2410740 raw words (569889 effective words)
  125. took 2.5s, 225457 effective words/s
  126. 2017-05-27 14:30:41,694 : INFO : training model with 8 workers on 10545 vocabulary
  127. and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=8
  128. 2017-05-27 14:30:41,696 : INFO : expecting 5700 sentences, matching count from corpus
  129. used for vocabulary survey
  130. 2017-05-27 14:30:42,712 : INFO : PROGRESS: at 29.16% examples, 183182 words/s,
  131. in_qsize 15, out_qsize 0
  132. 2017-05-27 14:30:43,754 : INFO : PROGRESS: at 69.96% examples, 203560 words/s,
  133. in_qsize 15, out_qsize 0
  134. 2017-05-27 14:30:44,804 : INFO : PROGRESS: at 91.97% examples, 173787 words/s,
  135. in_qsize 14, out_qsize 0
  136. 2017-05-27 14:30:44,973 : INFO : worker thread finished; awaiting finish of 7
  137. more threads
  138. 2017-05-27 14:30:44,989 : INFO : worker thread finished; awaiting finish of 6
  139. more threads
  140. 2017-05-27 14:30:45,028 : INFO : worker thread finished; awaiting finish of 5
  141. more threads
  142. 2017-05-27 14:30:45,061 : INFO : worker thread finished; awaiting finish of 4
  143. more threads
  144. 2017-05-27 14:30:45,097 : INFO : worker thread finished; awaiting finish of 3
  145. more threads
  146. 2017-05-27 14:30:45,101 : INFO : worker thread finished; awaiting finish of 2
  147. more threads
  148. 2017-05-27 14:30:45,121 : INFO : worker thread finished; awaiting finish of 1
  149. more threads
  150. 2017-05-27 14:30:45,125 : INFO : worker thread finished; awaiting finish of 0
  151. more threads
  152. 2017-05-27 14:30:45,128 : INFO : training on 2410740 raw words (569903 effective words)
  153. took 3.4s, 166370 effective words/s
  154. 2017-05-27 14:30:45,131 : INFO : training model with 8 workers on 10545 vocabulary and
  155. 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=8
  156. 2017-05-27 14:30:45,132 : INFO : expecting 5700 sentences, matching count from corpus
  157. used for vocabulary survey
  158. 2017-05-27 14:30:46,152 : INFO : PROGRESS: at 11.26% examples, 79348 words/s,
  159. in_qsize 16, out_qsize 0
  160. 2017-05-27 14:30:47,153 : INFO : PROGRESS: at 27.52% examples, 85992 words/s,
  161. in_qsize 16, out_qsize 0
  162. 2017-05-27 14:30:48,166 : INFO : PROGRESS: at 66.47% examples, 130273 words/s,
  163. in_qsize 15, out_qsize 0
  164. 2017-05-27 14:30:49,061 : INFO : worker thread finished; awaiting finish of 7
  165. more threads
  166. 2017-05-27 14:30:49,076 : INFO : worker thread finished; awaiting finish of 6
  167. more threads
  168. 2017-05-27 14:30:49,088 : INFO : worker thread finished; awaiting finish of 5
  169. more threads
  170. 2017-05-27 14:30:49,123 : INFO : worker thread finished; awaiting finish of 4
  171. more threads
  172. 2017-05-27 14:30:49,144 : INFO : worker thread finished; awaiting finish of 3
  173. more threads
  174. 2017-05-27 14:30:49,147 : INFO : worker thread finished; awaiting finish of 2
  175. more threads
  176. 2017-05-27 14:30:49,152 : INFO : worker thread finished; awaiting finish of 1
  177. more threads
  178. 2017-05-27 14:30:49,159 : INFO : worker thread finished; awaiting finish of 0
  179. more threads
  180. 2017-05-27 14:30:49,160 : INFO : training on 2410740 raw words (570333 effective words)
  181. took 4.0s, 141860 effective words/s
  182. 2017-05-27 14:30:49,161 : INFO : training model with 8 workers on 10545 vocabulary and
  183. 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=8
  184. 2017-05-27 14:30:49,163 : INFO : expecting 5700 sentences, matching count from corpus
  185. used for vocabulary survey
  186. 2017-05-27 14:30:50,185 : INFO : PROGRESS: at 31.78% examples, 193530 words/s,
  187. in_qsize 15, out_qsize 0
  188. 2017-05-27 14:30:51,244 : INFO : PROGRESS: at 48.51% examples, 141817 words/s,
  189. in_qsize 15, out_qsize 0
  190. 2017-05-27 14:30:52,278 : INFO : PROGRESS: at 69.96% examples, 134399 words/s,
  191. in_qsize 16, out_qsize 0
  192. 2017-05-27 14:30:52,918 : INFO : worker thread finished; awaiting finish of 7
  193. more threads
  194. 2017-05-27 14:30:52,936 : INFO : worker thread finished; awaiting finish of 6
  195. more threads
  196. 2017-05-27 14:30:52,945 : INFO : worker thread finished; awaiting finish of 5
  197. more threads
  198. 2017-05-27 14:30:52,976 : INFO : worker thread finished; awaiting finish of 4
  199. more threads
  200. 2017-05-27 14:30:52,979 : INFO : worker thread finished; awaiting finish of 3
  201. more threads
  202. 2017-05-27 14:30:52,984 : INFO : worker thread finished; awaiting finish of 2
  203. more threads
  204. 2017-05-27 14:30:52,995 : INFO : worker thread finished; awaiting finish of 1
  205. more threads
  206. 2017-05-27 14:30:52,998 : INFO : worker thread finished; awaiting finish of 0
  207. more threads
  208. 2017-05-27 14:30:52,999 : INFO : training on 2410740 raw words (570031 effective words)
  209. took 3.8s, 148864 effective words/s
  210. 2017-05-27 14:30:53,000 : INFO : training model with 8 workers on 10545 vocabulary
  211. and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=8
  212. 2017-05-27 14:30:53,002 : INFO : expecting 5700 sentences, matching count from
  213. corpus used for vocabulary survey
  214. 2017-05-27 14:30:54,024 : INFO : PROGRESS: at 34.48% examples, 202424 words/s,
  215. in_qsize 15, out_qsize 0
  216. 2017-05-27 14:30:55,035 : INFO : PROGRESS: at 68.58% examples, 201499 words/s,
  217. in_qsize 15, out_qsize 0
  218. 2017-05-27 14:30:56,010 : INFO : worker thread finished; awaiting finish of 7
  219. more threads
  220. 2017-05-27 14:30:56,017 : INFO : worker thread finished; awaiting finish of 6
  221. more threads
  222. 2017-05-27 14:30:56,048 : INFO : PROGRESS: at 96.89% examples, 183861 words/s,
  223. in_qsize 5, out_qsize 1
  224. 2017-05-27 14:30:56,049 : INFO : worker thread finished; awaiting finish of 5
  225. more threads
  226. 2017-05-27 14:30:56,071 : INFO : worker thread finished; awaiting finish of 4
  227. more threads
  228. 2017-05-27 14:30:56,084 : INFO : worker thread finished; awaiting finish of 3
  229. more threads
  230. 2017-05-27 14:30:56,099 : INFO : worker thread finished; awaiting finish of 2
  231. more threads
  232. 2017-05-27 14:30:56,101 : INFO : worker thread finished; awaiting finish of 1
  233. more threads
  234. 2017-05-27 14:30:56,104 : INFO : worker thread finished; awaiting finish of 0
  235. more threads
  236. 2017-05-27 14:30:56,104 : INFO : training on 2410740 raw words (570328 effective words)
  237. took 3.1s, 184129 effective words/s
  238. 2017-05-27 14:30:56,105 : INFO : training model with 8 workers on 10545 vocabulary and
  239. 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=8
  240. 2017-05-27 14:30:56,107 : INFO : expecting 5700 sentences, matching count from corpus
  241. used for vocabulary survey
  242. 2017-05-27 14:30:57,134 : INFO : PROGRESS: at 33.13% examples, 197730 words/s,
  243. in_qsize 15, out_qsize 0
  244. 2017-05-27 14:30:58,140 : INFO : PROGRESS: at 69.96% examples, 206423 words/s,
  245. in_qsize 15, out_qsize 0
  246. 2017-05-27 14:30:58,876 : INFO : worker thread finished; awaiting finish of 7
  247. more threads
  248. 2017-05-27 14:30:58,883 : INFO : worker thread finished; awaiting finish of 6
  249. more threads
  250. 2017-05-27 14:30:58,889 : INFO : worker thread finished; awaiting finish of 5
  251. more threads
  252. 2017-05-27 14:30:58,937 : INFO : worker thread finished; awaiting finish of 4
  253. more threads
  254. 2017-05-27 14:30:58,949 : INFO : worker thread finished; awaiting finish of 3
  255. more threads
  256. 2017-05-27 14:30:58,953 : INFO : worker thread finished; awaiting finish of 2
  257. more threads
  258. 2017-05-27 14:30:58,960 : INFO : worker thread finished; awaiting finish of 1
  259. more threads
  260. 2017-05-27 14:30:58,967 : INFO : worker thread finished; awaiting finish of 0
  261. more threads
  262. 2017-05-27 14:30:58,968 : INFO : training on 2410740 raw words (570312 effective words)
  263. took 2.9s, 199922 effective words/s
  264. 2017-05-27 14:30:58,969 : INFO : training model with 8 workers on 10545 vocabulary and
  265. 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=8
  266. 2017-05-27 14:30:58,970 : INFO : expecting 5700 sentences, matching count from corpus
  267. used for vocabulary survey
  268. 2017-05-27 14:30:59,991 : INFO : PROGRESS: at 32.86% examples, 198045 words/s,
  269. in_qsize 16, out_qsize 0
  270. 2017-05-27 14:31:00,993 : INFO : PROGRESS: at 68.23% examples, 201443 words/s,
  271. in_qsize 16, out_qsize 0
  272. 2017-05-27 14:31:01,881 : INFO : worker thread finished; awaiting finish of 7
  273. more threads
  274. 2017-05-27 14:31:01,888 : INFO : worker thread finished; awaiting finish of 6
  275. more threads
  276. 2017-05-27 14:31:01,907 : INFO : worker thread finished; awaiting finish of 5
  277. more threads
  278. 2017-05-27 14:31:01,922 : INFO : worker thread finished; awaiting finish of 4
  279. more threads
  280. 2017-05-27 14:31:01,941 : INFO : worker thread finished; awaiting finish of 3
  281. more threads
  282. 2017-05-27 14:31:01,948 : INFO : worker thread finished; awaiting finish of 2
  283. more threads
  284. 2017-05-27 14:31:01,955 : INFO : worker thread finished; awaiting finish of 1
  285. more threads
  286. 2017-05-27 14:31:01,961 : INFO : worker thread finished; awaiting finish of 0
  287. more threads
  288. 2017-05-27 14:31:01,962 : INFO : training on 2410740 raw words (570826 effective words)
  289. took 3.0s, 191072 effective words/s

最后可视化评论分类效果,实现代码如下所示:

  1. from matplotlib import pyplot as plt
  2. label= data_bi['Class'].ravel()
  3. values = data_bi[['PosSim' , 'NegSim']].values
  4. plt.scatter(values[:,0], values[:,1], c=label, alpha=0.4)
  5. plt.show()

效果如下所示:

从上图中可以看到,好评与差评基本上可以通过一条直线区分开(蓝色为差评,红色为好评)

该方法与传统思路完全不同,没有使用词频率,情感词等特征,其优点有:

  • 将数据集映射到了极低维度的空间,只有二维
  • 一种无监督的学习方法,不需要对原始训练数据进行标注
  • 具有普适性,在其他领域也可以用这种方法,只需要先找出该领域极其正和极其负的方法,将其与所有待识别的数据通过doc2vec转化为向量计算距离即可

  • 关注公众号,发送关键字:Java车牌识别,获取项目源码。