Tags: xgboost

Author: self refiner

——

Welcome to visit.My Jane bookas well asMy blog

**All the content of this blog is mainly for study, research and sharing. If you need to reprint it, please contact me, indicating the author and origin, and non-commercial use. Thank you.**

——

If you feel awkward in format, you can go.My Jane bookInside, the editing effect of markdown is good.

Contents

## 1. digest”

- xgboost It’s a great algorithm, basically when it comes to classification problems, you run with xgboost, because it works very well. This algorithm is derived from the Chen Tianqi bigwigs. I do not say much about its principles.
- This paper mainly introduces the implementation of xgboost algorithm. There are two ways to implement the xgboost algorithm.
Calling sklearn Library，The second is downloading.xgboostpackage。

## (1) calling the xgboost algorithm of the sklearns library for text categorization

Step one: turn the text into TFIDF vector.

This involves the preprocessing of the text, there are many operations, the routine is fixed. For example, to stop words, remove some specified strange symbols, participle and other operations. For details, you can read this article.Preprocessing of Chinese text 。You can read this article on how to use sklearn library to convert text to TFIDF vector.Using different methods to calculate TF-IDF value

- The second step: calling the xgboost algorithm of sklearn.

```
from xgboost import XGBClassifier
xgbc = XGBClassifier()
xgbc.fit(X_train, y_train)
```

X_train, y_trainThe data format meets the requirements of sklearns, which is not repeated here. In fact, the method of calculating TFIDF by sklearns can be called directly

### Using different methods to calculate TF-IDF value

Let’s look at the format of getting TFIDF from gensim.

```
[[(0, 0.33699829595119235),
(1, 0.8119707171924228),
(2, 0.33699829595119235),
(4, 0.33699829595119235)],
[(0, 0.10212329019650272),
(2, 0.10212329019650272),
(4, 0.10212329019650272),
(5, 0.9842319344536239)],
[(6, 0.5773502691896258), (7, 0.5773502691896258), (8, 0.5773502691896258)],
[(0, 0.33699829595119235),
(1, 0.8119707171924228),
(2, 0.33699829595119235),
(4, 0.33699829595119235)]]
```

So next we turn the TFIDF vector trained by gensim into the data format we need.

I write an example code here, you can imitate it, as for the tag’s writing.

```
a = [[(0, 0.33699829595119235),
(1, 0.8119707171924228),
(2, 0.33699829595119235),
(4, 0.33699829595119235)],
[(0, 0.10212329019650272),
(2, 0.10212329019650272),
(4, 0.10212329019650272),
(5, 0.9842319344536239)],
[(6, 0.5773502691896258), (7, 0.5773502691896258), (8, 0.5773502691896258)],
[(0, 0.33699829595119235),
(1, 0.8119707171924228),
(2, 0.33699829595119235),
(4, 0.33699829595119235)]]
with open('test.txt','w',encoding='utf-8') as fw:
for i in range(len(a)):
for j in range(len(a[i])):
fw.write(str(a[i][j][0]) + ":" + str(a[i][j][1]) + '\n')
```

Output from the above code:

```
0:0.33699829595119235
1:0.8119707171924228
2:0.33699829595119235
4:0.33699829595119235
0:0.10212329019650272
2:0.10212329019650272
4:0.10212329019650272
5:0.9842319344536239
6:0.5773502691896258
7:0.5773502691896258
8:0.5773502691896258
0:0.33699829595119235
1:0.8119707171924228
2:0.33699829595119235
4:0.33699829595119235
```

After the data is transformed, we can officially operate.

#### A. to import packets”

`import xgboost as xgb`

#### B. XGBoost to customize a data matrix class DMatrix to convert our data to matrices

- temp_train.txt And temp_test.txt is the data that we converted earlier.

```
dtrain = xgb.DMatrix('temp_train.txt')
dtest = xgb.DMatrix('temp_test.txt')
```

#### C. training and saving the model.”

- model parameter

```
param = {'max_depth':2,'eta':1,'silent':0,'objective':'binary:logistic'}
num_round = 2
```

- Training model and saving model

```
bst = xgb.train(param,dtrain,num_round)
bst.save_model('xgboost.model')
```

- The prediction label (xgboost gets the probability of the prediction label, so we want to turn it into a label).

`preds = bst.predict(dtest) # The probability of the first category is obtained.P_label = [round (value) for value in preds], we get predictive labels.`

## 3. summary”

xgboostIs a good algorithm, the general classification of races will first run with xgboost to see the effect, so here to share how to operate, I hope it will be helpful to everyone