You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert

8
7
@@ -30,7 +29,7 @@ The complete pipeline process and features such as
30
29

31
30
32
31
## News
33
-
32
+
-[June 13st, 2024] FlagData v3.0.0 update, supports multiple data types, dozens of operator pools for DIY, and generates high-quality data with one click
34
33
-[Dec 31st, 2023] FlagData v2.0.0 has been upgraded
35
34
-[Jan 31st, 2023] FlagData v1.0.0 is online!
36
35
@@ -49,10 +48,29 @@ The complete pipeline process and features such as
-[Strong community support](#Strong-community-support)
53
+
-[Users](#Users)
53
54
-[Reference project](#Reference-project)
54
55
-[License](#License)
55
56
57
+
# V3.0.0 UPDATE
58
+
With the feedback from the community, FlagData has been upgraded. This update provides a set of fool-proof language pre-training data construction tools. According to different data types, we provide one-click data quality improvement tasks such as Html, Text, Book, Arxiv, Qa, etc. Both novice users and advanced users can easily generate high-quality data.
59
+
- Novice users: Just confirm the data type to generate high-quality data.
60
+
- Advanced users: We provide dozens of operator pools for users to DIY their own LLM pre-training data construction process.
61
+
62
+
**Project Features:**
63
+
64
+
- Ease of use: Fool-style operation, simple configuration is all that is needed to generate high-quality data.
65
+
- Flexibility: Advanced users can customize the data construction process through various operator pools.
Image, Formula, etc. Tool scripts provide two forms: keeping full text and saving by category resolution.
117
119
118
-
See [ReadMe under all2txt Module](flagdata/all2txt/README.md) for an example.
120
+
See [Instructions for using all2txt modules](flagdata/all2txt/README.md) for an example.
119
121
120
122
### Data preprocessing phase
121
123
@@ -131,43 +133,33 @@ finally outputs a score of 0: 1.
131
133
+ For general cleaning rules, if it is greater than 0.5, it is classified as a specific language, otherwise it indicates
132
134
that the page is not sure what language it is and discards the page.
133
135
134
-
See [ReadMe under language_identification Module](flagdata/language_identification/README.md) for an example.
136
+
See [Instructions for using the language identification module](flagdata/language_identification/README.md) for an example.
135
137
136
138
#### Data cleaning
137
139
138
-
The cleaner module uses multi-process pool mp.Pool to process data in parallel in a multi-process manner. Use
139
-
SharedMemoryManager to create shareable data structures, and multiple processes share data in data processing.
140
-
141
-
Efficient data cleaning is achieved through multi-processes and shared memory:
140
+
We provide one-click data quality improvement tasks such as Html, Text, Book, Arxiv, Qa, etc. For more customized functions, users can refer to the "data_operator" section.
141
+
##### TextCleaner
142
+
TextCleaner provides a fast and extensible text data cleaning tool. It provides commonly used text cleaning modules.
143
+
Users only need to select the text_clean.yaml file in cleaner_builder.py to process text data.
144
+
For details, see[Instructions for using TextCleaner](flagdata/cleaner/docs/Text_Cleaner.md)
142
145
143
-
Currently, the following cleaning rules are included:
146
+
##### ArxivCleaner
147
+
ArxivCleaner provides a commonly used arxiv text data cleaning tool.
148
+
Users only need to select the arxiv_clean.yaml file in cleaner_builder.py to process arxiv data.
144
149
145
-
+ Emoticons and meaningless characters (regular)
146
-
+ Clean and reprint copyright notice information (Zhihu, csdn, brief book, blog park)
147
-
+ Remove unreasonable consecutive punctuation marks, and newline characters are unified as\ n
148
-
+ Remove personal privacy, URL and extra spaces such as mobile phone number and ID number
149
-
+ Remove irrelevant content such as beginning and end, and remove text whose length is less than n (currently nasty 100)
150
-
+ Convert simplified Chinese to traditional Chinese (opencc Library)
150
+
##### HtmlCleaner
151
+
HtmlCleaner provides commonly used Html format text extraction and data cleaning tools.
152
+
Users only need to run the main method to process arxiv data.
151
153
152
-
It takes only two steps to use the data cleaning feature of FlagData:
153
-
154
-
1. Modify the data path and format in the YAML configuration file. We give detailed comments on each parameter in the
155
-
configuration file template to explain its meaning. At the same time, you can refer
156
-
to [Configuration](#Configuration) Chapter.
157
-
158
-
2. Specify the configuration file path in the following code and run it
159
-
```python
160
-
from flagdata.cleaner.text_cleaner import DataCleaner
161
-
if__name__=="__main__": # Safe import of main module in multi-process
162
-
cleaner = DataCleaner("config.yaml")
163
-
cleaner.clean()
164
-
```
154
+
##### QaCleaner
155
+
QaCleaner provides commonly used Qa format text extraction and data cleaning tools.
156
+
Users only need to run the main method to process Qa data.
157
+
For details, see[Instructions for using Qa](flagdata/cleaner/docs/Qa_Cleaner.md)
165
158
166
-
The cleaned file will be saved in the format `jsonl` to the path corresponding to the `output` parameter specified in
167
-
the configuration file.
168
-
169
-
See [Tutorial 1: Clean the original text obtained from the Internet](/flagdata/cleaner/tutorial_01_cleaner.md) for an
170
-
example.
159
+
##### BookCleaner
160
+
BookCleaner provides a common book format text extraction and data cleaning tool.
161
+
Users only need to run the main method to process the book data.
162
+
For details, see[Instructions for using Book](flagdata/cleaner/docs/Book_Cleaner.md)
171
163
172
164
#### Quality assessment
173
165
@@ -182,7 +174,7 @@ This paper compares different text classification models, including logical regr
182
174
their performance. In the experiment, BERTEval and FastText models perform well in text classification tasks, and
183
175
FastText model performs best in terms of accuracy and recall rate. [experimental results are from ChineseWebText]
184
176
185
-
See [ReadMe under quality_assessment Module](flagdata/quality_assessment/README.md) for an example.
177
+
See [Instructions for using the quality assessment module](flagdata/quality_assessment/README.md) for an example.
186
178
187
179
#### Data deduplication
188
180
@@ -196,6 +188,7 @@ to retain only those texts that are very similar, while discard those texts with
196
188
default value is 0.87. At the same time, we use the distributed computing power of Spark to deal with large-scale data,
197
189
the idea of MapReduce is used to remove duplicates, and tuned by spark to deal with large-scale text data sets
198
190
efficiently.
191
+
199
192
The following is the similar text iterated in the process of data deduplication, which has slight differences in line
200
193
wrapping and name editing, but the deduplication algorithm can identify two paragraphs of text that are highly similar.
201
194
@@ -253,13 +246,13 @@ The analysis data analysis module provides the following functions:
253
246
254
247
+ length analysis of the text.
255
248
256
-
See [ReadMe under analysis Module](flagdata/analysis/README.md) for an example.
249
+
See [Instructions for using the analysis module](flagdata/analysis/README.md) for an example.
257
250
258
251
## Configuration
259
252
260
253
For the `data cleansing` and `data quality assessment` modules,
The configuration file is readable [YAML](https://yaml.org) format , provides detailed comments. Please make sure that
264
257
the parameters have been modified in the configuration file before using these modules.
265
258
@@ -268,10 +261,16 @@ Here are some important parameters you need to pay attention to:
268
261
### Data cleaning
269
262
270
263
```yaml
271
-
#Raw data to be cleaned
264
+
#待清洗的原始数据
272
265
input: ./demo/demo_input.jsonl
273
-
#Save path of data after cleaning
266
+
#清洗后数据的保存路径
274
267
output: ./demo/output.jsonl
268
+
# 待处理的字段
269
+
source_key: text
270
+
# key in the output file for saving
271
+
result_key: cleanedContent
272
+
# 需要选择的Pipline类
273
+
cleaner_class: ArxivCleaner
275
274
```
276
275
277
276
### Data Quality assessment
@@ -283,20 +282,37 @@ Here are some important parameters you need to pay attention to:
283
282
# The text_key field is the field being evaluated
284
283
text_key: "raw_content"
285
284
```
285
+
## Operator Pool
286
+
We provide some basic operators for data cleaning, filtering, format conversion, etc. to help users build their own data construction process.
287
+
288
+
The operators provided are divided into three types: Formatter, Pruner, and Filter. Formatter is used to process structured data and can be used for mutual conversion of data in different formats; Pruner is used to clean text data; Filter is used for sample filtering.
289
+
The figure below shows these operators in different processing locations and a list of some of the operators
An active community is inseparable from your contribution, if you have a new idea, welcome to join our community, let us
293
-
become a part of open source, together to contribute our own efforts for open source!
295
+
For detailed description, see[Instructions for using the data operator](flagdata/data_operator/Operator_ZH.md)
294
296
297
+
## Strong community support
298
+
### Community Support
299
+
If you have any questions about the use and code of this project, you can submit an issue. You can also contact us directly via email at [email protected];
300
+
301
+
An active community cannot be separated from your contribution. If you have a new idea, welcome to join our community, let us become part of open source, and contribute to open source together! ! !
Or follow Zhiyuan FlagOpen open source system, FlagOpen official website https://flagopen.baai.ac.cn/
304
+
Or follow the FlagOpen open source system, FlagOpen official website https://flagopen.baai.ac.cn/
298
305

299
306
307
+
### Questions and Feedback
308
+
- Please report issues and make suggestions through GitHub Issues, and we will respond quickly within 24 hours.
309
+
- You are also welcome to discuss actively in GitHub Discussions.
310
+
- If it is inconvenient to use GitHub, of course, everyone in the FlagData open source community can also speak freely. For reasonable suggestions, we will iterate in the next version.
311
+
We will invite experts in the field to hold online and offline exchanges regularly to share the latest LLM research results.
0 commit comments