Article From:https://www.cnblogs.com/U940634/p/9739462.html

Using sklearn package

CountVectorizerThrough the fit_transform function, the words in the text are transformed into the word frequency matrix.

  • get_feature_names()You can see the keywords of all texts.
  • vocabulary_You can see all text key words and their location.
  • toarray()You can see the result of word frequency matrix.

TfidfTransformerIt is the TF-IDF weight of every word in CountVectorizer.

 

TfidfVectorizerYou can combine CountVectorizer and TfidfTransformer to generate TFIDF values directly.

Its key parameters are:

  • max_df:This given feature can be applied to the TF-IDF matrix, where the swimsuit describes the highest occurrence rate of words in the document. Assuming that a term appears in 80% of the document summaries, it may carry very little information.
  • min_df:It can be an integer. It means that words must appear in more than 5 documents before they are taken into consideration. Set to 0.2, remember that words appear in at least 20% of the documents.
  • ngram_range:This parameter will be used to observe the single variable model (unigrams), the two element model (bigrams) and the three element model (trigrams).

Reprint: https://blog.csdn.net/qq_30868235/article/details/80389180

 

1、First, give a paragraph.

contents = [
    'I am Chinese.',
    'You are American.',
    'What's his name?',
    'Who is she?'
]

2、CountVectorizerUse of class

First, the CountVectorizer class is called, and then its fit_transfomr method is used to transform the contents into a word frequency matrix, or a vectorized matrix.

The matrix is then obtained by using the toArray () method for the vectorization matrix.

Finally, get the text key and its location through vocabulary_.

from sklearn.feature_extraction.text import CountVectorizer

countVectorizer=CountVectorizer()
textVector=countVectorizer.fit_transform(contents)  #Get a matrix of vectorization of a document

textVector.todense()   #Get this matrix
countVectorizer.vocabulary_   #Gets the attributes corresponding to each column

 

Set minimum length and regular expression

Because the above two steps can only get the length of more than 2 characters, the Chinese characters in the monad is also meaningful, set the CountVectorizer, so use min_df and token_pattren

1 countVectorizer=CountVectorizer(
2         min_df=0,
3         token_pattern=r"\b\w+\b")
4 textVector=countVectorizer.fit_transform(contents)
5 
6 textVector.todense()
7 countVectorizer.vocabulary_

 

3、TF-IDFOperation, calling TfidfTransformer

Call TfidfTransformer class from sklearn.feature_extraction.text.

The incoming string array, since fit_transform is not a single matrix, needs to be transformed into a matrix to extract the keyword

 1 #Invoke the TF-IDF package and calculate TFIDF
 2 
 3 from sklearn.feature_extraction.text import TfidfTransformer
 4 
 5 transformer=TfidfTransformer()
 6 tfidf=transformer.fit_transform(textVector)   #Incoming string array
 7 
 8 import pandas
 9 TFIDFDataFrame=pandas.DataFrame(tfidf.toarray())
10 TFIDFDataFrame.columns=countVectorizer.get_feature_names()   #Set column names to words

 

4、Keyword extraction

Use the argsort (a, axis=1) in numpy: sort the matrix A according to axis, and return the subscript after sorting.

axis=0,Sort the elements along the row (column), axis=1, and sort the elements along the column to the right (every row).

The corresponding participle can be extracted from the location index.

import numpy
TFIDFSorted=numpy.argsort(tfidf.toarray(),axis=1)[:,-2:]
TFIDFDataFrame.columns[TFIDFSorted].values

 

5、Practice exercise

1、Corpus building

 1 import os;
 2 import os.path;
 3 import codecs;
 4 
 5 filePaths = [];
 6 fileContents = [];
 7 for root, dirs, files in os.walk(
 8     "D:\\PDM\\2.8\\SogouC.mini\\Sample"
 9 ):
10     for name in files:
11         filePath = os.path.join(root, name);
12         filePaths.append(filePath);
13         f = codecs.open(filePath, 'r', 'utf-8')
14         fileContent = f.read()
15         f.close()
16         fileContents.append(fileContent)
17 
18 import pandas;
19 corpos = pandas.DataFrame({
20     'filePath': filePaths, 
21     'fileContent': fileContents
22 });

2、Participle (Chinese word segmentation)

 1 import re
 2 zhPattern=re.compile(u'[\u4e00-\u9fa5]+')
 3 
 4 import jieba
 5 segments=[]
 6 filePaths=[]
 7 
 8 for index,row in corpos.iterrows():
 9     segments=[]
10     filePath=row["filePath"]
11     fileContent=row["fileContent"]
12     segs=jieba.cut(fileContent)
13     for seg in segs:
14         if zhPattern.search(seg):
15             segments.append(seg)
16     filePaths.append(filePath)
17     row["fileContent"]=" ".join(segments)    #To satisfy the use of sklearn packages

 

3、Call stop words and TFIDF calculation

 1 from sklearn.feature_extraction.text import CountVectorizer
 2 from sklearn.feature_extraction.text import TfidfTransformer
 3 
 4 stopwords = pandas.read_csv(
 5     "D:\\PDM\\2.8\\StopwordsCN.txt",
 6     encoding='utf8', 
 7     index_col=False,
 8     quoting=3,
 9     sep="\t"
10 )
11 
12 countVectorizer = CountVectorizer(
13     stop_words=list(stopwords['stopword'].values),   #Unlike previous CountVectorizer, add stop words and remove the statistics of discontinued words.14     min_df=0, token_pattern=r"\b\w+\b"
15 )
16 textVector = countVectorizer.fit_transform(
17     corpos['fileContent']
18 )
19 
20 transformer = TfidfTransformer()
21 tfidf = transformer.fit_transform(textVector)

4、Keyword extraction

 1 import numpy;
 2 sort = numpy.argsort(tfidf.toarray(), axis=1)[:, -5:]
 3 names = countVectorizer.get_feature_names();
 4 
 5 keywords = pandas.Index(names)[sort].values
 6 
 7 tagDF = pandas.DataFrame({
 8     'filePath':corpos.filePath, 
 9     'fileContent':corpos.fileContent, 
10     'tag1':keywords[:, 0], 
11     'tag2':keywords[:, 1], 
12     'tag3':keywords[:, 2], 
13     'tag4':keywords[:, 3], 
14     'tag5':keywords[:, 4]
15 })

 

 

Related articles recommended

After constructing the corpus and completing the task of word segmentation, the textVector is obtained.

 

 1 from sklearn.metrics import pairwise_distances
 2 
 3 distance_matrix=pairwise_distances(
 4         textVector,
 5         metric="cosine")   #Vectorization matrix and formula
 6 
 7 m=1-pandas.DataFrame(distance_matrix)
 8 m.columns=filePaths
 9 m.index=filePaths
10 
11 sort=numpy.argsort(distance_matrix,axis=1)[:,1:6]
12 similarity5=pandas.Index(filePaths)[sort].values
13 
14 similarityDF=pandas.DataFrame({
15     'filePath':corpos.filePath, 
16     's1': similarity5[:, 0], 
17     's2': similarity5[:, 1], 
18     's3': similarity5[:, 2], 
19     's4': similarity5[:, 3], 
20     's5': similarity5[:, 4]})

 

 

 

Automatic summary

Algorithm steps:

Get articles that need abstracts;

Word frequency statistics for changing articles

We will carry out a sub Bureau for changing the articles (according to Chinese punctuation marks, we usually use them. ? Wait for clauses)

Cosine similarity between computed clauses and articles

Take the most similar clause as the abstract of the article.

 

First, build corpora, stop words, countVectorizer

Building sub corpora and vectorization matrices

 1 contents=[]
 2 summarys=[]
 3 filePahts=[]
 4 
 5 
 6 for index,row in corpos.iterrows():
 7     filePath=row["filePath"]
 8     fileContent=row["fileContent"]
 9     #A sub corpus is constructed, which is composed of the document and the grouping of the document.
10     subCorpos=[fileContent]+re.split(
11         r'[。?!\n]\s*',
12         fileContent
13     )
14     
15     segments=[]
16     suitCorpos=[]
17     for content in subCorpos:
18         segs=jieba.cut(content)
19         segment=" ".join(segs)
20         if len(segment.strip())>10:
21             segments.append(segment)
22             suitCorpos.append(content)
23     
24     textVector=countVectorizer.fit_transform(segments)
25     
26     distance_metrix=pairwise_distances(
27         textVector,
28         metric="cosine")
29     
30     sort=numpy.argsort(distance_metrix,axis=1)
31     
32     summary=pandas.Index(suitCorpos)[sort[0]].values[1]
33 
34     summarys.append(summary)
35     filePahts.append(filePath)
36     contents.append(fileContent)

 

Leave a Reply

Your email address will not be published. Required fields are marked *