5.4 文本数据结构化处理实例

5.4.1 英文文本结构化处理实例

本部分以 20 Newsgroups 数据库(http://www.qwone.com/~jason/20Newsgroups/)的新闻数据为例,说明英文文档-词项矩阵、词项-逆向文档频率矩阵以及词向量的构建方式。

使用 sklearn 内建函数 fetch_20newsgroups 加载 20 newsgroups 数据集 'rec.sport.baseball' 类新闻数据


In [1]:import sklearn
       from sklearn.datasets import fetch_20newsgroups
       dataset = fetch_20newsgroups(shuffle=True, random_state=1,remove=('headers', 'footers', 'quotes'),categories=['rec.sport.baseball'])
       dataset
Out[1]:{'DESCR':None,
        'data': [u"Hello All,\n\n …… ,\n\nMike",
                 u'\n',
                 ...],
        'description':'the 20 newsgroups by date dataset',
        'filenames':array(['……'], dtype='|S97'),
        'target':array([0, 0, 0, ..., 0, 0, 0]),
        'target_names':['rec.sport.baseball']}
In [2]:len(dataset.data) # 查看文本数
Out[2]:597

构建 dataset.data 文档-词项矩阵


In [3]:from sklearn.feature_extraction.text import CountVectorizer
       dtm_vectorizer = CountVectorizer(stop_words="english")
       dtm = dtm_vectorizer.fit_transform(dataset.data).toarray()
       dtm
Out[3]:array([[0, 0, 0, ..., 0, 0, 0],
              [0, 0, 0, ..., 0, 0, 0],
              [0, 0, 0, ..., 0, 0, 0],
              ..., 
              [0, 0, 0, ..., 0, 0, 0],
              [0, 0, 0, ..., 0, 0, 0],
              [0, 0, 0, ..., 0, 0, 0]])

在文档-词项矩阵 dtm 基础上构建 TF-IDF 矩阵


In [4]:from sklearn.feature_extraction.text import TfidfTransformer
       transformer = TfidfTransformer()
       tfidf = transformer.fit_transform(dtm).toarray()
       tfidf
Out[4]:array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
              [ 0.,  0.,  0., ...,  0.,  0.,  0.],
              [ 0.,  0.,  0., ...,  0.,  0.,  0.],
              ..., 
              [ 0.,  0.,  0., ...,  0.,  0.,  0.],
              [ 0.,  0.,  0., ...,  0.,  0.,  0.],
              [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

在训练词向量之前先进行分词和停用词过滤等处理


In [5]:import nltk
       from nltk.tokenize import sent_tokenize
       from nltk.corpus import stopwords
       english_stopwords = stopwords.words("english")
       english_punctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '!', '@', '#', '%', '$', '*'] # 自定义英文表单符号列表
       words=[word_tokenize(t) for t in dataset.data]
       words_lower=[[j.lower() for j in i] for i in words] # 小写处理
       words_clear=[]
       for i in words_lower:
           words_filter=[]
           for j in i:
               if j not in english_stopwords: # 过滤停用词
                   if j not in english_punctuations: # 过滤标点符号
                 words_filter.append(j)
           words_clear.append(words_filter)
       words_clear
Out[5]:[[u'hello',u"'d",u'like',……,u'use',u'appreciate',u'mike'],
        ……
        [u'someone',u'sabr',u'actually',……,u'4', u'5',u'runs']]

使用初步处理后的词汇列表 words_clear 训练词向量


In [6]:import gensim
       model = gensim.models.Word2Vec(words_clear, size=100, window=5, min_count=5)

查看模型中所有词汇


In [7]:set(model.vocab.keys())
Out[7]:{u'inning',u'saves',u'foul',...}

查看训练得到的词向量


In [8]:model['sorry']
Out[8]:array([-0.14678614, -0.12967731, -0.08349328, -0.12420464, -0.04189511,
              -0.03648274,  0.05199799,  0.05459175,  0.10753562,  0.00180915,
              ……
              -0.03970845,  0.00807121, -0.12356181, -0.20895423, -0.14045434,
               0.08667169, -0.17613558, -0.21308741, -0.048795  ,  0.03967994]
               , dtype=float32)

6.4.2 中文文本结构化处理实例

中文实例采用和 4.1.1.4 节相同的新闻文本作为结构化分析实例,该数据是从数据堂(http://more.datatang.com/)网站下载的新浪社会新闻。

首先加载数据


In [1]:with open ("new.txt") as f:
           read=f.readlines()

提取前 500 个新闻标题,保存在列表 title 中


In [2]:title=[]
       for i in read[:500]:
           title.append(i.split("|")[1].decode("utf-8")) # 去除“|”后标题索引位置为[1],并利用 decode 函数进行解码处理

利用 jieba 中文分词直接对新闻标题进行分词处理,并过滤停用词


In [3]:import jieba
       with open("stopwords.txt") as f: # 打开停用词典
           read=f.read().decode('utf-8') # 以 utf-8 编码格式解码字符串
       stop_words = read.splitlines()
       texts=[] 
       for i in title:
           title_seg=[]
           segs = jieba.cut(i) # 分词处理 
           for seg in segs:  
               if seg not in stop_words: # 过滤停用词
                   title_seg.append(seg)
           texts.append(title_seg)
           texts
Out[3]:[[u'\u7f51\u6c11',u'\u8d28\u7591',……,u'\u6708',u'\u516c\u793a'],……]

利用 gensim 库构建文档-词项矩阵


In [5]:import gensim
       from gensim import corpora
       dictionary = corpora.Dictionary(texts)
In [6]:word_count = [dictionary.doc2bow(text) for text in texts]
       word_count
Out[6]:[[(0, 1), (1, 1), (2, 1), ……, (6, 1), (7, 1), (8, 1)],...]
In [7]:from gensim.matutils import corpus2dense
       dtm_matrix=corpus2dense(word_count, len(dictionary))
       dtm_matrix.T
Out[7]:array([[ 1.,  1.,  1., ...,  0.,  0.,  0.],
              [ 0.,  0.,  0., ...,  0.,  0.,  0.],
              [ 0.,  0.,  0., ...,  0.,  0.,  0.],
              ..., 
              [ 0.,  0.,  0., ...,  0.,  0.,  0.],
              [ 0.,  0.,  0., ...,  1.,  1.,  0.],
              [ 0.,  0.,  0., ...,  0.,  0.,  1.]], dtype=float32)

利用 gensim 库构建词项-逆文档频率矩阵


In [8]:from gensim import models
       tfidf_model = models.TfidfModel(word_count) 
       tfidf = tfidf_model[word_count]
       tfidf_matrix=corpus2dense(tfidf, len(dictionary))
       tfidf_matrix.T
Out[8]:array([[0.3563464,0.18751191,0.24476767,...,0.        ,0.        ,  0.        ],
              [0.       ,0.        ,0.        ,...,0.        ,0.        ,0.        ],
              ..., 
              [0.        ,0.        ,0.        ,...,0.34610009,0.34610009,  0.        ],
              [0.        ,0.        ,0.        ,...,0.        ,0.        ,  0.50546187]], dtype=float32)

利用 gensim 库训练词向量


In [9]:model = gensim.models.Word2Vec(texts, size=100, window=5, min_count=5)

查看部分词汇训练结果


In [10]:model[u'北京']
Out[10]:array([ -3.26092471e-03,   1.06515212e-03,   4.17973427e-03,
                 3.81700741e-03,   2.17494601e-03,   1.85929902e-03,
                ……
                -2.32837372e-03,  -1.31162710e-03,   4.11018310e-03,
                 2.64904671e-03], dtype=float32)

In [11]:model.similarity(u'台风',u'暴雨')
Out[11]:0.13788290850313972

results matching ""

    No results matching ""