Skip to content

Commit b619c19

Browse files
authored
Pretrain (PaddlePaddle#207)
* add textcnn_pretrain * add textcnn_pretrain * change classification to textcnn * add readme in paddlerec
1 parent 3061d46 commit b619c19

22 files changed

+856
-15
lines changed

doc/pre_train_model.md

+20-2
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,27 @@ PaddleRec基于业务实践,使用真实数据,产出了推荐领域算法
77
### 获取地址
88

99
```bash
10-
wget xxx.tar.gz
10+
wget https://paddlerec.bj.bcebos.com/textcnn_pretrain%2Fpretrain_model.tar.gz
1111
```
1212

1313
### 使用方法
1414

15-
解压后,得到的是一个paddle的模型文件夹,使用`PaddleRec/models/contentunderstanding/classification_finetue`模型进行加载
15+
解压后,得到的是一个paddle的模型文件夹,使用`PaddleRec/models/contentunderstanding/textcnn`模型进行加载
16+
您可以在PaddleRec/models/contentunderstanding/textcnn_pretrain中找到finetune_startup.py文件,在config.yaml中配置startup_class_path和init_pretraining_model_path两个参数。
17+
在参数startup_class_path中配置finetune_startup.py文件的地址,在init_pretraining_model_path参数中配置您要加载的参数文件。
18+
以textcnn_pretrain为例,配置完的runner如下:
19+
```
20+
runner:
21+
- name: train_runner
22+
class: train
23+
epochs: 6
24+
device: cpu
25+
save_checkpoint_interval: 1
26+
save_checkpoint_path: "increment"
27+
init_model_path: ""
28+
print_interval: 10
29+
startup_class_path: "{workspace}/finetune_startup.py"
30+
init_pretraining_model_path: "{workspace}/pretrain_model/pretrain_model_params"
31+
phases: phase_train
32+
```
33+
具体使用方法请参照textcnn[使用预训练模型进行finetune](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/contentunderstanding/textcnn_pretrain)

doc/yaml.md

+2
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,8 @@
3737
| startup_class_path | string | 路径 || 自定义startup流程实现的地址 |
3838
| runner_class_path | string | 路径 || 自定义runner流程实现的地址 |
3939
| terminal_class_path | string | 路径 || 自定义terminal流程实现的地址 |
40+
| init_pretraining_model_path | string | 路径 ||自定义的startup流程中需要传入这个参数,finetune中需要加载的参数的地址 |
41+
4042

4143

4244

models/contentunderstanding/readme.md

+6-6
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# 内容理解模型库
22

33
## 简介
4-
我们提供了常见的内容理解任务中使用的模型算法的PaddleRec实现, 单机训练&预测效果指标以及分布式训练&预测性能指标等。实现的内容理解模型包括 [Tagspace](tagspace)[文本分类](classification)等。
4+
我们提供了常见的内容理解任务中使用的模型算法的PaddleRec实现, 单机训练&预测效果指标以及分布式训练&预测性能指标等。实现的内容理解模型包括 [Tagspace](tagspace)[文本分类](textcnn)[基于textcnn的预训练模型](textcnn_pretrain)等。
55

66
模型算法库在持续添加中,欢迎关注。
77

@@ -23,7 +23,7 @@
2323
| 模型 | 简介 | 论文 |
2424
| :------------------: | :--------------------: | :---------: |
2525
| TagSpace | 标签推荐 | [EMNLP 2014][TagSpace: Semantic Embeddings from Hashtags](https://www.aclweb.org/anthology/D14-1194.pdf) |
26-
| Classification | 文本分类 | [EMNLP 2014][Convolutional neural networks for sentence classication](https://www.aclweb.org/anthology/D14-1181.pdf) |
26+
| textcnn | 文本分类 | [EMNLP 2014][Convolutional neural networks for sentence classication](https://www.aclweb.org/anthology/D14-1181.pdf) |
2727

2828
下面是每个模型的简介(注:图片引用自链接中的论文)
2929

@@ -32,7 +32,7 @@
3232
<img align="center" src="../../doc/imgs/tagspace.png">
3333
<p>
3434

35-
[文本分类CNN模型](https://www.aclweb.org/anthology/D14-1181.pdf)
35+
[textCNN模型](https://www.aclweb.org/anthology/D14-1181.pdf)
3636
<p align="center">
3737
<img align="center" src="../../doc/imgs/cnn-ckim2014.png">
3838
<p>
@@ -42,7 +42,7 @@
4242
git clone https://github.com/PaddlePaddle/PaddleRec.git paddle-rec
4343
cd PaddleRec
4444
python -m paddlerec.run -m models/contentunderstanding/tagspace/config.yaml
45-
python -m paddlerec.run -m models/contentunderstanding/classification/config.yaml
45+
python -m paddlerec.run -m models/contentunderstanding/textcnn/config.yaml
4646
```
4747

4848
## 使用教程(复现论文)
@@ -134,7 +134,7 @@ batch: 13, acc: [0.928], loss: [0.01736144]
134134
batch: 14, acc: [0.93], loss: [0.01911209]
135135
```
136136

137-
**(2)Classification**
137+
**(2)textcnn**
138138

139139
### 数据处理
140140
情感倾向分析(Sentiment Classification,简称Senta)针对带有主观描述的中文文本,可自动判断该文本的情感极性类别并给出相应的置信度。情感类型分为积极、消极。情感倾向分析能够帮助企业理解用户消费习惯、分析热点话题和危机舆情监控,为企业提供有利的决策支持。
@@ -206,4 +206,4 @@ batch: 3, acc: [0.90234375], loss: [0.27907994]
206206
| 数据集 | 模型 | loss | acc |
207207
| :------------------: | :--------------------: | :---------: |:---------: |
208208
| ag news dataset | TagSpace | 0.0198 | 0.9177 |
209-
| ChnSentiCorp | Classification | 0.2282 | 0.9127 |
209+
| ChnSentiCorp | textcnn | 0.2282 | 0.9127 |

models/contentunderstanding/classification/config.yaml models/contentunderstanding/textcnn/config.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
1414

15-
workspace: "models/contentunderstanding/classification"
15+
workspace: "models/contentunderstanding/textcnn"
1616

1717
dataset:
1818
- name: data1

models/contentunderstanding/classification/readme.md models/contentunderstanding/textcnn/readme.md

+4-5
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
1-
# classification文本分类模型
1+
# textcnn文本分类模型
22

33
以下是本例的简要目录结构及说明:
44

55
```
66
├── data #样例数据
77
├── train
8-
├── train.txt #训练数据样例
8+
├── train.txt #训练数据样例
99
├── test
1010
├── test.txt #测试数据样例
1111
├── preprocess.py #数据处理程序
@@ -15,7 +15,6 @@
1515
├── config.yaml #配置文件
1616
├── reader.py #读取程序
1717
```
18-
1918
注:在阅读该示例前,建议您先了解以下内容:
2019
[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md)
2120

@@ -73,13 +72,13 @@ os : windows/linux/macos
7372
本文提供了样例数据可以供您快速体验,在paddlerec目录下直接执行下面的命令即可启动训练:
7473

7574
```
76-
python -m paddlerec.run -m models/contentunderstanding/classification/config.yaml
75+
python -m paddlerec.run -m models/contentunderstanding/textcnn/config.yaml
7776
```
7877

7978

8079
## 效果复现
8180
为了方便使用者能够快速的跑通每一个模型,我们在每个模型下都提供了样例数据。如果需要复现readme中的效果,请按如下步骤依次操作即可。
82-
1. 确认您当前所在目录为PaddleRec/models/contentunderstanding/classification
81+
1. 确认您当前所在目录为PaddleRec/models/contentunderstanding/textcnn
8382
2. 下载并解压数据集,命令如下:
8483
```
8584
wget https://baidu-nlp.bj.bcebos.com/sentiment_classification-dataset-1.0.0.tar.gz
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
import paddle.fluid as fluid
16+
from paddlerec.core.utils import envs
17+
from paddlerec.core.model import ModelBase
18+
from paddlerec.core.metrics import RecallK
19+
20+
21+
class Model(ModelBase):
22+
def __init__(self, config):
23+
ModelBase.__init__(self, config)
24+
self.dict_size = 2000000 + 1
25+
self.max_seq_len = 1024
26+
self.emb_dim = 128
27+
self.cnn_hid_dim = 128
28+
self.cnn_win_size = 3
29+
self.cnn_win_size2 = 5
30+
self.hid_dim1 = 96
31+
self.class_dim = 30
32+
self.is_sparse = True
33+
34+
def input_data(self, is_infer=False, **kwargs):
35+
36+
text = fluid.data(
37+
name="text", shape=[None, self.max_seq_len, 1], dtype='int64')
38+
label = fluid.data(name="category", shape=[None, 1], dtype='int64')
39+
seq_len = fluid.data(name="seq_len", shape=[None], dtype='int64')
40+
return [text, label, seq_len]
41+
42+
def net(self, inputs, is_infer=False):
43+
""" network definition """
44+
#text label
45+
self.data = inputs[0]
46+
self.label = inputs[1]
47+
self.seq_len = inputs[2]
48+
emb = embedding(self.data, self.dict_size, self.emb_dim,
49+
self.is_sparse)
50+
concat = multi_convs(emb, self.seq_len, self.cnn_hid_dim,
51+
self.cnn_win_size, self.cnn_win_size2)
52+
self.fc_1 = full_connect(concat, self.hid_dim1)
53+
self.metrics(is_infer)
54+
55+
def metrics(self, is_infer=False):
56+
""" classification and metrics """
57+
# softmax layer
58+
prediction = fluid.layers.fc(input=[self.fc_1],
59+
size=self.class_dim,
60+
act="softmax",
61+
name="pretrain_fc_1")
62+
cost = fluid.layers.cross_entropy(input=prediction, label=self.label)
63+
avg_cost = fluid.layers.mean(x=cost)
64+
acc = fluid.layers.accuracy(input=prediction, label=self.label)
65+
#acc = RecallK(input=prediction, label=label, k=1)
66+
67+
self._cost = avg_cost
68+
if is_infer:
69+
self._infer_results["acc"] = acc
70+
else:
71+
self._metrics["acc"] = acc
72+
73+
74+
def embedding(inputs, dict_size, emb_dim, is_sparse):
75+
""" embeding definition """
76+
emb = fluid.layers.embedding(
77+
input=inputs,
78+
size=[dict_size, emb_dim],
79+
is_sparse=is_sparse,
80+
param_attr=fluid.ParamAttr(
81+
name='pretrain_word_embedding',
82+
initializer=fluid.initializer.Xavier()))
83+
return emb
84+
85+
86+
def multi_convs(input_layer, seq_len, cnn_hid_dim, cnn_win_size,
87+
cnn_win_size2):
88+
"""conv and concat"""
89+
emb = fluid.layers.sequence_unpad(
90+
input_layer, length=seq_len, name="pretrain_unpad")
91+
conv = fluid.nets.sequence_conv_pool(
92+
param_attr=fluid.ParamAttr(name="pretrain_conv0_w"),
93+
bias_attr=fluid.ParamAttr(name="pretrain_conv0_b"),
94+
input=emb,
95+
num_filters=cnn_hid_dim,
96+
filter_size=cnn_win_size,
97+
act="tanh",
98+
pool_type="max")
99+
conv2 = fluid.nets.sequence_conv_pool(
100+
param_attr=fluid.ParamAttr(name="pretrain_conv1_w"),
101+
bias_attr=fluid.ParamAttr(name="pretrain_conv1_b"),
102+
input=emb,
103+
num_filters=cnn_hid_dim,
104+
filter_size=cnn_win_size2,
105+
act="tanh",
106+
pool_type="max")
107+
concat = fluid.layers.concat(
108+
input=[conv, conv2], axis=1, name="pretrain_concat")
109+
return concat
110+
111+
112+
def full_connect(input_layer, hid_dim1):
113+
"""full connect layer"""
114+
fc_1 = fluid.layers.fc(name="pretrain_fc_0",
115+
input=input_layer,
116+
size=hid_dim1,
117+
act="tanh")
118+
return fc_1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
workspace: "models/contentunderstanding/textcnn_pretrain"
16+
17+
dataset:
18+
- name: dataset_train
19+
batch_size: 128
20+
type: DataLoader
21+
data_path: "{workspace}/senta_data/train"
22+
data_converter: "{workspace}/reader.py"
23+
- name: dataset_infer
24+
batch_size: 256
25+
type: DataLoader
26+
data_path: "{workspace}/senta_data/test"
27+
data_converter: "{workspace}/reader.py"
28+
29+
hyper_parameters:
30+
optimizer:
31+
class: adam
32+
learning_rate: 0.001
33+
strategy: async
34+
35+
mode: [train_runner,infer_runner]
36+
37+
runner:
38+
- name: train_runner
39+
class: train
40+
epochs: 6
41+
device: cpu
42+
save_checkpoint_interval: 1
43+
save_checkpoint_path: "increment"
44+
init_model_path: ""
45+
print_interval: 10
46+
# startup class for finetuning
47+
startup_class_path: "{workspace}/finetune_startup.py"
48+
# path of pretrained model. Please set empty if you don't use finetune function.
49+
init_pretraining_model_path: "{workspace}/pretrain_model/pretrain_model_params"
50+
51+
phases: phase_train
52+
53+
- name: infer_runner
54+
class: infer
55+
# device to run training or infer
56+
device: cpu
57+
print_interval: 1
58+
init_model_path: "increment/3" # load model path
59+
phases: phase_infer
60+
61+
62+
phase:
63+
- name: phase_train
64+
model: "{workspace}/model.py"
65+
dataset_name: dataset_train
66+
thread_num: 1
67+
- name: phase_infer
68+
model: "{workspace}/model.py" # user-defined model
69+
dataset_name: dataset_infer # select dataset by name
70+
thread_num: 1

0 commit comments

Comments
 (0)