A consolidated framework for AI and ML components developed for the new portal.
To run the app locally:
- Go to the root folder of this project.
- Create a .env file. (or simply copy the .env.sample file)
- Add the following (with your real keys):
API_KEY="your_actual_api_key_here"
OPENAI_API_KEY="your_actual_openai_api_key_here"
PROFILE="your_actual_environment_here"
If you plan to train models, also include:
ES_ENDPOINT="your_actual_elasticsearch_endpoint"
ES_API_KEY="your_actual_es_api_key"
- Copy the content in .env.sample
to
.env` and fill in your keys. - Run
Then visit http://localhost:8000
./startServer.sh
- Test App health at http://localhost:8000/api/v1/ml/health
-
Install Conda (if not already installed):
Follow the instructions at Conda Installation.
-
Create Conda virtual environment:
conda env create -f environment.yml
Poetry is used for dependency management, the pyproject.toml
file is what is the most important, it will orchestrate the project and its dependencies.
You can update the file pyproject.toml
for adding/removing dependencies by using
poetry add <pypi-dependency-name> # e.g poetry add numpy
poetry remove <pypi-dependency-name> # e.g. poetry remove numpy
You might want to update the poetry.lock
file after manually modifying pyproject.toml
with poetry lock
command. To update all dependencies, use poetry update
command.
-
Activate Conda virtual environment:
conda activate data-discovery-ai
-
Install environment dependencies:
# after cloning the repo with git clone command cd data-discovery-ai poetry install
FastAPI runs internal checks before making \process_record
API calls. These checks include:
- β
Required model resource files must be present in
data_discovery_ai/resources/
. - β
A valid
OPENAI_API_KEY
must be in.env
unless you're indevelopment
environment. - β
If
PROFILE=development
, Ollama must be running locally at http://localhost:11434.
To use the Llama3 model locally without OpenAI:
-
Go to Ollama download page and download the version that matches your operating system (Windows, Linux, or macOS).
-
After installation, start Ollama either by launching the app or running the following command:
ollama serve
-
Pull the "llama3" model used for local developmentοΌ
ollama pull llama3
-
(Optional) Consider installing Open WebUI to test Llama3 locally through a user-friendly interface:
docker run -d --network=host -v open-webui:/app/backend/data -e PORT=8090 -e OLLAMA_BASE_URL=http://127.0.0.1:11434 --name open-webui ghcr.io/open-webui/open-webui:main
Once the Open WebUI container is running, open your browser and go to: http://localhost:8090.
Simply run:
poetry run uvicorn data_discovery_ai.server:app --reload --log-config=log_config.yaml
Run all tests with:
poetry run python -m unittest discover -s tests
Run manual checks:
pre-commit run --all-files
Checks are also executed when you run git commit
. The configurations for pre-commit hooks are defined in .pre-commit-config.yaml
.
We are using gitmoji(OPTIONAL) with husky and commitlint. Here you have an example of the most used ones:
- π¨ - Improving structure/format of the code.
- β‘ - Improving performance.
- π₯ - Removing code or files.
- π - Fixing a bug.
- π - Critical hotfix.
- β¨ - Introducing new features.
- π - Adding or updating documentation.
- π - Deploying stuff.
- π - Updating the UI and style files.
- π - Beginning a project.
Example of use:
:wrench: add husky and commitlint config
hotfix/
: for quickly fixing critical issues,usually/
: with a temporary solutionbugfix/
: for fixing a bugfeature/
: for adding, removing or modifying a featuretest/
: for experimenting something which is not an issuewip/
: for a work in progress
And add the issue id after an /
followed with an explanation of the task.
Example of use:
feature/5348-create-react-app
Once the app is running, two routes are available:
Route | Description |
---|---|
GET /api/v1/ml/health |
Health check |
POST /api/v1/ml/process_record |
One single point for calling AI models to process metadata record |
{
"selected_model":["description_formatting"],
"title": "test title",
"abstract": "test abstract"
}
Required Header
X-API-Key: your_api_key
(Must match the value of API_KEY
specified in the environment variables).
AI Model Options
selected_model
: the AI models provided bydata-discovery-ai
. It should be a list of strings, which are the name of the AI task agents. Currently, four AI task agents available for distinctive tasks:keyword_classification
: predict keywords from AODN vocabularies based on metadatatitle
andabstract
with pretrained ML model.delivery_classification
: predict data delivery mode based on metadatatitle
,abstract
, andlineage
with pretrained ML model.description_formatting
: reformatting long abstract into Markdown format based on metadatatitle
andabstract
with LLM model "gpt-4o-mini".link_grouping
: categorising links into four groups: ["Python Notebook", "Document", "Data Access", "Other"] based on metadatalinks
.
Currently, two machine learning pipelines are available for training and evaluating models:
keyword
: keyword classification model, which is a Sequential model for multi-label classification taskdelivery
: data delivery classification model, which is a self-learning model for binary classification task
To run one of the pipelines (for example, the keyword one), you can use the following command in your terminal:
python -m data_discovery_ai.ml.pipeline --pipeline keyword --start_from_preprocess False --model_name experimental
You can also use a shorter version:
python -m data_discovery_ai.ml.pipeline -p keyword -s False -n experimental
If the raw data has changed (e.g., updated, cleaned, or expanded), you are recommended to re-train the model using the latest data. To do this, set:
--start_from_preprocess True
As mentioned in Environment variables, ElasticSearch endpoint and API key are required to be set up in .env
file.
Running a pipeline trains a Machine Learning model and saves several resource files for reuse β so you donβt have to reprocess data or retrain the model every time.
-
delivery
pipelineOutputs are saved in:
data_discovery_ai/resources/DataDeliveryModeFilter/
File Name Description filter_preprocessed.pkl
Preprocessed data used for training and testing development.pkl
Trained binary classification model development.pca.pkl
PCA model used for dimensionality reduction -
keyword
pipelineOutputs are saved in:
data_discovery_ai/resources/KeywordClassifier/
File Name Description keyword_sample.pkl
Preprocessed data used for training and testing keyword_label.pkl
Mapping between labels and internal IDs development.keras
Trained Keras model file (name set by --model_name
)
These files are generated automatically during pipeline training and saved automatically after pipeline running. They are intended for reuse in subsequent runs to avoid retraining.
The --model_name
argument helps organise different versions of your model. Here's how theyβre typically used:
Name | Purpose | When to Use |
---|---|---|
development |
Active model development | For testing and iterating on new ideas |
experimental |
Try new techniques or tuning | For exploring new features or architectures |
benchmark |
Compare against the current baseline model | When validating improvements over a previous version |
staging |
Pre-production readiness | When testing full integration before final deployment |
production |
Final production model | Live version used in production APIs or systems |
π Tip: When working locally, use
--model_name experimental
to avoid overwriting files used in deployments.
Each model name reflects a stage in the model lifecycle:
-
Development
- Initial model design and prototyping
- Reaches minimum performance targets with stable training
-
Experimental
- Shows consistent performance improvements
- Experiment logs and results are clearly documented
-
Benchmark
- Outperforms the existing benchmark (usually a copy of the production model)
- Validated using selected evaluation metrics
-
Staging
- Successfully integrated with application components (e.g. APIs)
- Ready for deployment, pending final checks
-
Production
- Deployed in a live environment
- Monitored continuously, supports user feedback and live data updates
In the configuration file data_discovery_ai/common/parameters.yaml
, you can specify which model version each task should use. For example:
model:
delivery_classification:
pretrained_model: development
This means the agent handling the delivery_classification
task will use the development
version of the model.
We use MLflow to track model training and performance over time (like hypermeters, accuracy, precision, etc.).
To start the tracking server locally, run:
./start_mlflow.sh
Once it's running, you can open the tracking dashboard in your browser: http://127.0.0.1:53000
You can change the model's training settings (like how long it trains or how fast it learns) by editing the trainer section in the file: data_discovery_ai/config/parameters.yaml
File | Description |
---|---|
data_discovery_ai/common/constants.py |
Shared constants |
data_discovery_ai/commom/parameters.yaml |
Store parameter settings for ML models and AI agents. |
data_discovery_ai/
βββ config/ # Common utilities and shared configurations/constants used across modulesβ
βββ core/ # Core logic of the application such as API routes
βββ agents/ # Task-specific agent modules using ML/AI/rule-based tools
βββ ml/ # Machine learning models: training, inference, evaluation logic
βββ utils/ # Utility functions and helper scripts for various tasks
βββ resources/ # Stored assets such as pretrained models, sample datasets, and other resources required for model inference
βββ notebooks/ # Jupyter notebooks
βββ tests/ # Unit tests for validating core components
β βββ agents
β βββ ml
β βββ utils
βββ server.py # FastAPI application entry point