Exploratory data analysis and machine-learning pipeline for the IBM Transactions for Anti-Money Laundering (AML) synthetic dataset.
Dataset: https://www.kaggle.com/datasets/ealtman2019/ibm-transactions-for-anti-money-laundering-aml
# 1. Install uv (if not present)
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2. Create virtual environment and install dependencies
uv sync
# 3. Download dataset (requires Kaggle API token at ~/.kaggle/kaggle.json)
uv run python -c "
import kaggle
kaggle.api.authenticate()
kaggle.api.dataset_download_files(
'ealtman2019/ibm-transactions-for-anti-money-laundering-aml',
path='data/raw', unzip=True
)
"
# 4. Launch Jupyter
uv run jupyter labThe benchmark family includes multiple HI / LI and Small / Medium / Large splits, but this repo currently analyses only the HI-Large split.
- Notebook transaction input:
data/raw/HI-Large_Trans.csv - Notebook patterns input:
data/raw/HI-Large_Patterns.txt - Other splits and companion files are not used by
notebooks/01_eda.ipynb
In the broader benchmark:
HI= high-illicit-ratio split, with roughly 5% laundering-labelled transactionsLI= low-illicit-ratio split, with roughly 0.1% laundering-labelled transactionsSmall/Medium/Largerefer to dataset size, not a different schema
| Group | Size | Approx illicit ratio |
|---|---|---|
| HI | Small | ~5 % |
| HI | Medium | ~5 % |
| HI | Large | ~5 % |
| LI | Small | ~0.1 % |
| LI | Medium | ~0.1 % |
| LI | Large | ~0.1 % |
Each split ships with three companion files:
*_Trans.csv= transaction-level records*_accounts.csv= account metadata*_Patterns.txt= ground-truth laundering pattern blocks
| Column | Description |
|---|---|
| Timestamp | Date-time of the transaction |
| From Bank | Originating bank ID |
| Account (from) | Originating account ID |
| To Bank | Receiving bank ID |
| Account (to) | Receiving account ID |
| Amount Received | Amount received (in Receiving Currency) |
| Receiving Currency | ISO currency code at destination |
| Amount Paid | Amount paid (in Payment Currency) |
| Payment Currency | ISO currency code at source |
| Payment Format | Wire, Cheque, Credit Card, ACH, etc. |
| Is Laundering | Binary label – 1 = laundering, 0 = legitimate |
- Fan-out – One account rapidly sends to many recipients
- Fan-in – Many accounts consolidate to one
- Bipartite – Many-to-many transfer block between sender and receiver sets
- Cycle – Money circulates through a closed loop of accounts
- Gather-Scatter – Aggregate then disperse through layering
- Scatter-Gather – Disperse then re-aggregate
- Stack – Layered pass-through chains
- Random – Irregular mixing pattern
The structuring / smurfing view from the EDA is one of the clearest visuals in the notebook:
aml-transaction-detection/
├── data/
│ ├── raw/ <- original Kaggle CSVs
│ └── processed/ <- parquet caches
├── notebooks/
│ └── 01_eda.ipynb <- main EDA notebook
├── src/ <- helper modules (future)
├── pyproject.toml
└── README.md
