PyChildes is a Python-based preprocessing toolkit designed for working with the CHILDES corpus. This tool is currently under active development to provide streamlined and efficient methods for processing and analyzing CHILDES data.
The CHILDES (Child Language Data Exchange System) corpus is a part of the larger TalkBank project, providing an invaluable resource for researchers studying child language acquisition. PyChildes aims to simplify the handling of these datasets by offering robust tools for preprocessing, transforming, and analyzing CHAT-formatted data files.
To learn more about the CHAT Transcription Format, please refer to the official documentation.
When citing the use of TalkBank and CHILDES facilities, please use this reference to the last printed version of the CHILDES manual:
@book{macwhinney2000childes,
author = {MacWhinney, Brian},
title = {The CHILDES Project: Tools for Analyzing Talk},
edition = {3rd},
year = {2000},
publisher = {Lawrence Erlbaum Associates},
address = {Mahwah, NJ}
}
This tool is a part of the broader Trabank toolkit that we are actively developing. If you find this tool useful, please give us credit by citing:
@misc{trabank,
title={Trabank: A Toolkit for Computational Developmental Studies in Language Models},
author={Ma, Ziqiao and Chai, Joyce and Shi, Freda},
howpublished={https://github.com/Mars-tin/PyChildes},
year={2025}
}
We use Google docstring format for our docstrings and the pre-commit library to check our code. To install pre-commit, run the following command:
conda install pre-commit # or pip install pre-commit
pre-commit install
The pre-commit hooks will run automatically when you try to commit changes to the repository.
Start by creating two sub-directories: raw/
and prep/
in data/
.
mkdir raw
mkdir prep
mkdir raw/childes
mkdir prep/childes
- Download the
Eng-NA
andEng-UK
collections of the Childes corpora, or run the following scripts.cd raw/ wget https://childes.talkbank.org/access/Eng-NA/0-Eng-NA-MOR.zip wget https://childes.talkbank.org/access/Eng-UK/0-Eng-UK-MOR.zip
- Unzip the file and organize it in the
raw/
directory as follows.
.
├── childes
│ ├── ENG-NA
│ │ └── ...
│ └── ENG-UK
│ ├── ...
│ └── Wells
│ ├── ...
│ └── Tony
│ ├── 010526.cha
│ ├── 010826.cha
│ ├── 011114.cha
│ ├── 020310.cha
│ ├── 020526.cha
│ ├── 020902.cha
│ ├── 021123.cha
│ ├── 030321.cha
│ └── 030608.cha
The script prepare_childes.py
is designed to preprocess .cha
files from the CHILDES corpus based on a specified configuration file.
Path to the configuration file (.yaml
) defining preprocessing rules and settings are under configs/
.
- Purpose: Manage metadata lines starting with
@
, containing session and participant information. - Options:
keep_data
(bool): whether to retain header lines (default: false
).
- Purpose: Control the processing of speaker utterances (main lines starting with
*
). - Subcomponents:
keep_data
andkeep_speaker
: retain or remove utterances and speaker tags.interposed
,nonverbal
: handle interposed comments and silent actions.- basic: handle markers like satellite (
‡
), tone (↑
/↓
), pauses, etc. - linkers: manage special termination markers like trail-offs (
+...
), interruptions (+/
), latching (++
), and more. - incomplete: handle incomplete or omitted words.
- specform: process special word forms (
@b
,@c
,@d
, etc.) like babbling, dialect, neologisms. - unidentifiable: tag unintelligible, phonological, and untranscribed material (
xxx
,yyy
,www
). - disfluency: manage speech disfluencies like fragments (
&+
), fillers (&-
), nonwords (&~
). - scoped: process scoped annotations like paralinguistic events (
[^]
), replacements ([:]
), stressing ([!]
), retracing ([//]
), etc.
- Purpose: Manage supplemental information lines starting with
%
. - Options:
keep_data
(bool): retain dependent tiers.action
: control action-dependent tiers like%act
.
- Default behaviors are specified for each option.
- Extensive references to the CHAT manual are included for detailed definitions.
- This config enables flexible, fine-grained control over CHAT transcription processing.
-
--input_file
(str): Path to the.cha
file to be processed.
Default:raw/childes/Eng-NA/Bates/Free20/amy.cha
-
--output_file
(str): Path where the processed.cha
file will be saved.
Default:prep/output.cha
-
--config_path
(str): Path to the configuration file (.yaml
) defining preprocessing rules and settings.
Default:configs/default.yaml
To process a file named amy.cha
located at raw/childes/Eng-NA/Bates/Free20/
using a custom configuration file my_config.yaml
:
python prepare_childes.py --input_file raw/childes/Eng-NA/Bates/Free20/amy.cha --output_file prep/my_output.cha --config_path configs/my_config.yaml
The processed file will be saved to prep/my_output.cha
.