- Overview
- Example Use Cases
- Notable Requirements
- Features
- π Diagrams
- Installation
- Usage
- Contributing
- License
This tool converts individual patient records into structured time-interval feature vectors, making them suitable for filtering, aggregation, and assembly into a data matrix D for binary classification machine learning tasks.
Compute summary statistics (e.g., the mean of n variables) for each unique patient, resulting in one row per patient. This is ideal for models requiring a single representation per individual.
Generate a monthly time series for each patient that includes:
- Biochemistry results
- Demographic attributes
- MedCat-derived clinical text annotations
The time series spans up to 25 years retrospectively, aligned to each patient's diagnosis date, enabling a consistent retrospective view across varying start times.
- CogStack (cogstack_v8_lite) (cogstack_search_methods)
- Elasticsearch
- MedCat https://github.com/CogStack/MedCAT
- Python >=3.10
- Python3.10-venv (for install_pat2vec.py)
See requirements.txt
- Single patient
- Batch patient
- Cohort search and creation
- Automated random controls
- Modular feature space selection
- Look back
- Look forward
- Individual patient time windows.
This project includes a collection of diagrams illustrating the system architecture, data pipelines, ingestion examples, and method workflows.
You can view the Mermaid definitions or the rendered diagrams below.
Diagram | Mermaid | Image |
---|---|---|
System Architecture | assets/system_architecture.mmd | ![]() |
Configuration | assets/config.mmd |
Diagram | Mermaid | Image |
---|---|---|
Data Pipeline | assets/data_pipeline.mmd | ![]() |
Main Batch Processing | assets/main_batch.mmd | |
Example Ingestion | assets/example_ingestion.mmd | ![]() |
Diagram | Mermaid | Image |
---|---|---|
Methods Annotation | assets/methods_annotation.mmd | ![]() |
Post-Processing Build Methods | assets/post_processing_build_methods.mmd |
Diagram | Mermaid | Image |
---|---|---|
Ethnicity Abstractor | assets/ethnicity_abstractor.mmd | |
Get BMI | assets/get_bmi.mmd | |
Get Demographics | assets/get_demographics.mmd | |
Get Diagnostics | assets/get_diagnostics.mmd | |
Get Drugs | assets/get_drugs.mmd | |
Get Smoking | assets/get_smoking.mmd | |
Get News | assets/get_news.mmd | |
Get Dummy Data Cohort Searcher | assets/get_dummy_data_cohort_searcher.mmd | |
Get Method Bloods | assets/get_method_bloods.mmd | |
Get Method Patient Annotations | assets/get_method_pat_annotations.mmd | |
Get Treatment Docs (No Terms Fuzzy) | assets/get_treatment_docs_by_iterative_multi_term_cohort_searcher_no_terms_fuzzy.mmd |
-
Clone the repository: cd to gloabl_files
git clone https://github.com/SamoraHunter/pat2vec.git cd pat2vec
Run the installation script:
install.bat
-
Add the
pat2vec
directory to the Python path:Before importing
pat2vec
in your Python script, add the following lines to the script, replacing/path/to/pat2vec
with the actual path to thepat2vec
directory inside your project:import sys sys.path.append('/path/to/pat2vec')
-
Import
pat2vec
in your Python script:import pat2vec
This option installs pat2vec
along with its dependencies, including:
pat2vec_env
(virtual environment)snomed_methods
cogstack_search_methods
clinical_note_splitter
Before running the installation, ensure you:
- Place the model pack in the appropriate directory gloabl_files/medcat_models/%modelpack%.zip
- Populate the credentials file under gloabl_files/credentials.py
- (Optional) Add a SNOMED file if needed gloabl_files/.. 'snomed', 'SnomedCT_InternationalRF2_PRODUCTION_20231101T120000Z', 'SnomedCT_InternationalRF2_PRODUCTION_20231101T120000Z', 'Full', 'Terminology', 'sct2_StatedRelationship_Full_INT_20231101.txt'
-
Copy the
install_pat2vec.sh
file to your installation directory. -
Grant execution permissions:
chmod +x install_pat2vec.sh
-
Run the installation using one of the following options:
- Standard installation:
./install_pat2vec.sh
- Installation with proxy mirror support:
./install_pat2vec.sh --proxy
- Install to a specific directory:
./install_pat2vec.sh --directory /path/to/install
- Skip cloning repositories (if already cloned manually):
./install_pat2vec.sh --no-clone
- Standard installation:
The script will clone the following repositories:
-
Clone the repository:
git clone https://github.com/SamoraHunter/pat2vec.git
. Run the installation script:
(Requires python3 on path and venv) chmod +x install.sh ./install.sh
cd pat2vec
-
Add the
pat2vec
directory to the Python path:Before importing
pat2vec
in your Python script, add the following lines to the script, replacing/path/to/pat2vec
with the actual path to thepat2vec
directory inside your project:import sys sys.path.append('/path/to/pat2vec')
-
Import
pat2vec
in your Python script:import pat2vec
-
Set paths, gloabl_files/medcat_models/modelpack.zip, gloabl_files/snomed_methods, gloabl_files/..
-
gloabl_files/
- medcat_models/
- modelpack.zip
- snomed_methods/snomed_methods_v1.py**
- pat2vec/
- pat2vec_projects/
- project_01/
- example_usage.ipynb
- treatment_docs.csv
- project_01/
- medcat_models/
*treatment_docs.csv should contain a column 'client_idcode' with your UUID's. **https://github.com/SamoraHunter/SNOMED_methods.git
-
Configure options
-
Run all
-
Examine example_usage.ipynb for additional functionality and use cases.
-
open example_usage.ipynb and hit run all.
-
If testing in a live environment ensure the testing flag is set to False in the config_obj.
Contributions are welcome! Please see the contributing guidelines for more information.
This project is licensed under the MIT License - see the LICENSE file for details