Article From:https://www.cnblogs.com/U940634/p/9735946.html

word frequency:The content of a word in that document.

 

1、Corpus building

import jieba
jieba.load_userdict("D:\\Python\\PythonData mining \\Python data mining practical course courseware \\2.2\\ Jin Yong Wugong style.Txt")

import os
import os.path
import codecs

filePaths=[]
fileContents=[]
for root,dirs,files in os.walk("D:\\Python\\PythonData mining \\Python data mining practical course courseware \\2.2\\SogouC.mini\\Sample"):
    for name in files:
        filePath=os.path.join(root,name)
        filePaths.append(filePath)
        f=codecs.open(filePath,"r","utf-8")
        fileContent=f.read()
        f.close()
        fileContents.append(fileContent)
        
import pandas
corpos=pandas.DataFrame({
                         "filePath":filePaths,
                         "fileContent":fileContents})

#Which article is the source of participle?
import jieba

segments=[]
filePaths=[]
for index,row in corpos.iterrows():
    filePath=row["filePath"]
    fileContent=row["fileContent"]
    segs=jieba.cut(fileContent)
    for seg in segs:
        segments.append(seg)
        filePaths.append(filePath)
        
segmentDataFrame=pandas.DataFrame({
                                   "segment":segments,
                                   "filepath":filePaths})

 

2、word frequency count

import numpy
#Word frequency statistics
#byThey are grouped columns, [] are columns to be counted.
segStat=segmentDataFrame.groupby(
            by="segment"
            )["segment"].agg({
            "count":numpy.size
            }).reset_index().sort(columns=["count"],   #Reset the index, and then sort it in reverse order according to count.Ascending=False)

by=[“Column names] followed by the columns to be grouped and counted according to the contents of the columns in square brackets.

The second [] is a column to be counted, which is counted on the basis of a grouped column, and can be itself

 

 

3、Remove the stop words, because many of the statistics are unnecessary, so we need to remove them.

stopwords=pandas.read_csv(
    "D:\\Python\\PythonData mining \\Python data mining practical course courseware \\2.3\\StopwordsCN.txt",    #Change the file to include a stop word.Encoding="utf-8",
    index_col=False)

fSegStat=segStat[
        ~segStat.segment.isin(stopwords.stopword)]

The method used is isin (), and then takes the inverse.

 

 

Second word segmentation methods:

import jieba

segments=[]
filePaths=[]

for index,row in corpos.iterrows():
    filePath=row["filePath"]
    fileContent=row["fileContent"]
    segs=jieba.cut(fileContent)
    for seg in segs:
        if seg not in stopwords.stopword.values and len(seg.strip())>0:
            segments.append(seg)
            filePaths.append(filePath)

segmentDataFrame=pandas.DataFrame({
        "segment":segments,
        "filePath":filePaths})

segStat=segmentDataFrame.groupby(
                    by="segment"
                    )["segment"].agg({
                    "count":numpy.size
                    }).reset_index().sort(
                        columns=["count"],
                        ascending=False)

 

The second method is to select the words that are not in stopwords by judging if after Jieba segmentation, and then output them as data frames and count them statistically.

 

Leave a Reply

Your email address will not be published. Required fields are marked *