data-postdiction-pipeline

A repository containing the code for the Data Postdiction project using machine learning to replace values within a column.

Dependencies

The Python components of this project were implemented using Python 3.12. Note that the specific versions for most dependencies are provided in requirements.txt which can be installed by running pip install -r requirements.txt. To run the pipeline run python pipeline.py <config-file.yaml> where the config file matches the configs under configuration_files/ and the dataset is in the specified location using the database parameter.

For the AFD detection using the Pyro algorithm, the Pyro-distro-1.0-SNAPSHOT-distro and metanome-cli-1.1.0 jars are needed to build the setup and open jdk build 11.0.28+6 was used to build and test.

Environment

All of the experiments were conducted on physical hardware using a Thinkpad P14S Gen 5 equipped with an Intel Ultra 7 155H CPU with 22 logical processors with 64 GB of RAM. Scalability experiments were also conducted on a Thinkpad T470 equipped with an Intel Core i5-7300U CPU with 4 logical processors and 16 GB of RAM.

Citations

Anna Baskin, Scott Heyman, Brian T. Nixon, Constantinos Costa, and Panos K. Chrysanthis, "Remembering the Forgotten: Clustering, Outlier Detection, and Accuracy Tuning in a Postdiction Pipeline," in European Conference on Advances in Databases and Information Systems (ADBIS), 2023, pp. 46-55.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
configuration_files		configuration_files
djp_data_configs		djp_data_configs
pyro_feature_selection		pyro_feature_selection
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analysis.py		analysis.py
batching.py		batching.py
big_data.py		big_data.py
clustering.py		clustering.py
compression.py		compression.py
config.py		config.py
feature_selection.py		feature_selection.py
helper.py		helper.py
models.py		models.py
outlier_detection.py		outlier_detection.py
pipeline.py		pipeline.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
text.py		text.py
train_model.py		train_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

data-postdiction-pipeline

Dependencies

Environment

Citations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

data-postdiction-pipeline

Dependencies

Environment

Citations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages