This project is designed to process and code address data from the CBDB address code table. The main script, code_addr.py, reads input data, processes it, and outputs coded address information.
code_addr.py: Main script for processing and coding address data.addr_data_schema.xlsx: Schema for address data.ADDRESSES.txt: Processed address data.cbdb_entity_address_types.csv: List of address types.input_small.txt: Small input dataset for testing.input.txt: Main input dataset.output.txt: Output file containing coded address data.ZZZ_ADDRESSES.xlsx: Original address data in Excel format.
-
Install Dependencies
Ensure the required dependencies are installed. Use the following command to install them:pip install pandas char-converter
-
Load Your Input Data
Prepare your input data based on the following schema:id dy addr_name addr_belong time 1 宋 甌寧 建州 1279 2 清 江南太平府 no_info no_infoSave the input data in
input.txt. -
Run the Script
Execute the script to process the input data and generate the output:python code_addr.py
-
To convert variants to simplified Chinese as part of a standardization step, modify the script:
Changeuse_char_converter = False
to
use_char_converter = True
-
The script processes address data by reading from
ZZZ_ADDRESSES.xlsx. You can download the latest version ofZZZ_ADDRESSES.xlsxfrom CBDB on Hugging Face.
This project is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.