- Define the get_instance_status and get_job_status() methods of the BenchmarkLauncher - Issue #570 by @R-Palazzo
- Define the terminate() method of the BenchmarkLauncher - Issue #568 by @R-Palazzo
- Define workflows to be able to run from a config file or some given parameters - Issue #547 by @R-Palazzo
- Add a script that launches a benchmark from a yaml file or a set of parameters - Issue #546 by @R-Palazzo
- Move the current benchmark configs to yaml files - Issue #545 by @R-Palazzo
- Update the "wins" computation to include the Pareto front - Issue #572 by @R-Palazzo
- The ResultExplorer should look for latest run results by default - Issue #552 by @R-Palazzo
- Update load_results to be able to filter on dataset or synthesizer - Issue #551 by @R-Palazzo
OUTPUT_DESTINATION_AWSpoints to the wrong location - Issue #564 by @R-Palazzo
- Include SDV-Enterprise in single-table benchmarks - Issue #549 by @R-Palazzo
- Add extremity data points of the Pareto curve for the Quality–Speed Tradeoff plot - Issue #556 by @R-Palazzo
- Internal benchmark results upload crashes if there's no error column in the result table - Issue #544 by @R-Palazzo
- Update RELEASE guide to include conda-forge step - Issue #560 by @sarahmish
- Support Python 3.14 - Issue #528 by @pvk-developer
- Update license information in pyproject.toml to use new format - Issue #527 by @pvk-developer
- Set the SDGym Slack alert to be posted on the
sdgymchannel. - Issue #555 by @R-Palazzo
- Add a Dataset Details and a Model Details excel sheets when uploading benchmark results - Issue #532 by @R-Palazzo
- Add workflow to run SDGym multi-table benchmark monthly and publish results - Issue #516 by @R-Palazzo
- Define internal single and multi table methods to run on GCP - Issue #515 by @R-Palazzo
- Add multi table support to ResultsExplorer - Issue #488 by @fealho
- Add benchmark_multi_table_aws - Issue #487 by @R-Palazzo
- Add benchmark_multi_table function - Issue #486 by @pvk-developer
- Add multi-table UniformSynthesizer - Issue #485 by @R-Palazzo
- Private S3 bucket access fails in benchmark_multi_table_aws despite valid credentials - Issue #525 by @R-Palazzo
- RealTabFormer 0.2.4 causes integration to fail - Issue #523 by @R-Palazzo
- Remove deprecated parameters - Issue #519 by @fealho
- Update multi-table dataset list - Issue #535 by @R-Palazzo
- If there are no datasets in the bucket, the
DatasetExplorershould show a warning and return an empty table - Issue #475 by @fealho - Add input validation for the
DatasetExplorerclass and functions - Issue #474 by @fealho
- Record the train and sample times whenever an error occurs during a benchmark. - Issue #503 by @R-Palazzo
- Workflow fails due to lack of space - Issue #511 by @rwedge
- Rename create_sdv_synthesizer_variant to create_synthesizer_variant - Issue #491 by @R-Palazzo
- SDGym should be able to automatically discover SDV Enterprise synthesizers - Issue #481 by @R-Palazzo
- Incorporate the
get_available_datasetsfunctionality into theDatasetExplorer- Issue #473 by @fealho
- Update result aggregation logic in the ResultExplorer to match new naming schema - Issue #494 by @R-Palazzo
- When running a benchmark locally, the
additional_datasets_folderpath should be the root path - Issue #484 by @fealho
- Missing dependency openpyxl - Issue #479 by @rwedge
- Add a
DatasetExplorerclass that provides a summary of all datasets in a bucket (for a given modality) - Issue #469 by @pvk-developer - Update SDGym to use the new S3 bucket and bucket structure - Issue #468 by @pvk-developer
- Update Pareto plot data generation to use the Adjusted Time and Quality score - Issue #462 by @R-Palazzo
- The
ResultsExplorershould allow programmatic access to all the saved artifacts from benchmarking - Issue #450 by @R-Palazzo - When performing multiple SDGym runs on the same day, save the artifacts with consistent naming - Issue #448 by @R-Palazzo
- To simulate graceful degradation, fallback to using the results from the UniformSynthesizer - Issue #439 by @rwedge
- Pip install sdgym released version on ec2 machines - Issue #437 by @pvk-developer
- Add a Fallback to UniformSynthesizer when an error occur and improve the time tracker of the synthetic data generation - Issue #436 by @R-Palazzo
- Make the synthesizer names consistent throughout SDGym - Issue #430 by @R-Palazzo
- Simplify the import API for SDGym's results explorer - Issue #429 by @R-Palazzo
- Add workflow to run SDGym monthly and publish results - Issue #425 by @R-Palazzo
- Add benchmark_single_table_aws function - Issue #414 by @R-Palazzo
- Add summarize function to SDGymResultsExplorer class - Issue #412 by @R-Palazzo
- Add SDGymResultsExplorer class - Issue #411 by @R-Palazzo
- Add ability to save synthesizers and data when running benchmark_single_table - Issue #410 by @R-Palazzo
- Update REalTabFormer default parameters so that it runs on benchmarking - Issue #400 by @fealho
- Add DCRBaseline Metric to single table report - Issue #397 by @gsheni
- Update link to s3 results in the Slack Alert message - Issue #464 by @R-Palazzo
- EC2 instance not terminating after timeout - Issue #463 by @R-Palazzo
- Adjusted time and quality score not aggregating correctly on EC2 - Issue #461 by @R-Palazzo
- Update warning message for deprecated parameters - Issue #455 by @R-Palazzo
- The
UniformSynthesizerproduces multipleUserWarningmessages when run on a demo dataset - Issue #449 by @R-Palazzo - Always include UniformSynthesizer doesn't work on AWS - Issue #446 by @R-Palazzo
- Fix minimum test version due to RealTabFormer and Torch releases - Issue #434 by @R-Palazzo
- Add modality parameter to get_available_datasets function - Issue #403 by @gsheni
- Update the EC2 instance used when run_on_ec2 is enabled - Issue #396 by @R-Palazzo
- All bump-version commands are failing - Issue #391 by @amontanez24
- To simulate graceful degradation, always run the UniformSynthesizer on all the requested datasets - Issue #438 by @rwedge
- Remove support for Python 3.8 - Issue #457 by @fealho
- Check pyproject for release candidate dependencies - Issue #406 by @rwedge
- Update the library installation script for EC2 machines to install optional dependencies like RealTabFormer - Issue #388 by @R-Palazzo
- Speed up test_benchmark_single_table_realtabformer_no_metrics integration test - Issue #379 by @fealho
- Update python set up step in workflows to use latest python version - Issue #361 by @frances-h
- Support Python 3.13 - Issue #355 by @rwedge
- Add workflow to release SDGym on PyPI - Issue #418 by @gsheni
- Add integration with 3rd party synthesizer (REalTabFormer) - Issue #347 by @cristid9
- Add support for numpy 2.0.0 - Issue #315 by @R-Palazzo
- Minimum tests failing because of broken action - Issue #351 by @amontanez24
- The
ColumnSynthesizershould follow the sdtypes in the metadata (not the data's dtypes) - Issue #249 by @fealho
- Minimum tests fail due to dependency version mismatch - Issue #376 by @amontanez24
- Create Prepare Release workflow - Issue #364 by @R-Palazzo
- Migrate
SDVsynthesizer to Use UnifiedMetadataInstead of LegacySingleTableMetadata- Issue #359 by @fealho - Update codecov and add flag for integration tests - Issue #354 by @amontanez24
AttributeErrorwhen running custom synthesizer with timeout - Issue #335 by @fealho
This release enables the diagnostic score to be computed in a benchmarking run. It also renames the IndependentSynthesizer to ColumnSynthesizer. Finally, it fixes a bug so that the time for all metrics will now be used to compute the Evaluate_Time column in the results.
- Cap numpy to less than 2.0.0 until SDGym supports - Issue #313 by @gsheni
- The returned
Evaluate_Timedoes not include results from all metrics - Issue #310 by @lajohn4747
- Rename
IndependentSynthesizertoColumnSynthesizer- Issue #319 by @lajohn4747 - Allow the ability to compute diagnostic score in a benchmarking run - Issue #311 by @lajohn4747
This release adds support for both Python 3.11 and 3.12! It also drops support for Python 3.7.
This release adds a new parameter to benchmark_single_table called run_on_ec2. When enabled, it will launch a t2.medium ec2 instance on the user's AWS account using the credentials they specify in environment variables. The benchmarking will then run on this instance. The output_filepath must be provided and must be in the format {s3_bucket_name}/{path_to_file} when run_on_ec2 is enabled.
- Docs for AWS integration are incorrect - Issue #304 by @srinify
- Add support for Python 3.11 - Issue #250 by @fealho
- Remove anyio usage - Issue #252 by @lajohn4747
- Drop support for Python 3.7 - Issue #254 by @R-Palazzo
- Switch default branch from master to main - Issue #257 by @R-Palazzo
- Transition from using setup.py to pyproject.toml to specify project metadata - Issue #266 by @R-Palazzo
- Remove bumpversion and use bump-my-version - Issue #267 by @R-Palazzo
- Switch to using ruff for Python linting and code formatting - Issue #268 by @gsheni
- Add dependency checker - Issue #277 by @lajohn4747
- Add bandit workflow - Issue #282 by @R-Palazzo
- Cleanup automated PR workflows - Issue #286 by @R-Palazzo
- Add support for Python 3.12 - Issue #288 by @fealho
- Only run unit and integration tests on oldest and latest python versions for macos - Issue #294 by @R-Palazzo
- Bump verions SDV, SDMetrics and RDT - Issue #298
- The
UniformSynthesizershould follow the sdtypes in metadata (not the data's dtypes) - Issue #248 by @lajohn4747 - Fix minimum version workflow when pointing to github branch - Issue #280 by @R-Palazzo
- Passing synthesizer as string fails if run_on_ec2 is enabled - Issue #306 by @lajohn4747
- Add run_on_ec2 flag to benchmark_single_table - Issue #265 by @lajohn4747
- Remove FastML Synthesizer - Issue #292 by @lajohn4747
This release adds support for SDV 1.0 and PyTorch 2.0!
- Add functions to top level import - Issue #229 by @fealho
- Cleanup SDGym to the new SDV 1.0 metadata and synthesizers - Issue #212 by @fealho
- limit_dataset_size causes sdgym to crash - Issue #231 by @fealho
- benchmark_single_table crashes with metadata dict - Issue #232 by @fealho
- Passing None as synthesizers runs all of them - Issue #233 by @fealho
- timeout parameter causes sdgym to crash - Issue #234 by @pvk-developer
- SDGym is not working with latest torch - Issue #210 by @amontanez24
- Fix sdgym --help - Issue #206 by @katxiao
- Increase code style lint - Issue #123 by @fealho
- Remove code support for synthesizers that are not strings/classes - PR #236 by @fealho
- Code Refactoring - Issue #215 by @fealho
- Remove pomegranate - Issue #230 by @amontanez24
This release introduces methods for benchmarking single table data and creating custom synthesizers, which can be based on existing SDGym-defined synthesizers or on user-defined functions. This release also adds support for Python 3.10 and drops support for Python 3.6.
- Benchmarking progress bar should update on one line - Issue #204 by @katxiao
- Support local additional datasets folder with zip files - Issue #186 by @katxiao
- Enforce that each synthesizer is unique in benchmark_single_table - Issue #190 by @katxiao
- Simplify the file names inside the detailed_results_folder - Issue #191 by @katxiao
- Use SDMetrics silent report generation - Issue #179 by @katxiao
- Remove arguments in get_available_datasets - Issue #197 by @katxiao
- Accept metadata.json as valid metadata file - Issue #194 by @katxiao
- Check if file or folder exists before writing benchmarking results - Issue #196 by @katxiao
- Rename benchmarking argument "evaluate_quality" to "compute_quality_score" - Issue #195 by @katxiao
- Add option to disable sdmetrics in benchmarking - Issue #182 by @katxiao
- Prefix remote bucket with 's3' - Issue #183 by @katxiao
- Benchmarking error handling - Issue #177 by @katxiao
- Allow users to specify custom synthesizers' display names - Issue #174 by @katxiao
- Update benchmarking results columns - Issue #172 by @katxiao
- Allow custom datasets - Issue #166 by @katxiao
- Use new datasets s3 bucket - Issue #161 by @katxiao
- Create benchmark_single_table method - Issue #151 by @katxiao
- Update summary metrics - Issue #134 by @katxiao
- Benchmark individual methods - Issue #159 by @katxiao
- Add method to create a sdv variant synthesizer - Issue #152 by @katxiao
- Add method to generate a multi table synthesizer - Issue #149 by @katxiao
- Add method to create single table synthesizers - Issue #148 by @katxiao
- Updating existing synthesizers to new API - Issue #154 by @katxiao
- Pip encounters dependency issues with ipython - Issue #187 by @katxiao
- IndependentSynthesizer is printing out ConvergeWarning too many times - Issue #192 by @katxiao
- Size values in benchmarking results seems inaccurate - Issue #184 by @katxiao
- Import error in the example for benchmarking the synthesizers - Issue #139 by @katxiao
- Updates and bugfixes - Issue #132 by @csala
- Update README - Issue #203 by @katxiao
- Support Python Versions >=3.7 and <3.11 - Issue #170 by @katxiao
- SDGym Package Maintenance Updates documentation - Issue #163 by @katxiao
- Remove YData - Issue #168 by @katxiao
- Update to newest SDV - Issue #157 by @katxiao
- Update slack invite link. - Issue #144 by @pvk-developer
- updating workflows to work with windows - Issue #136 by @amontanez24
- Update conda dependencies - Issue #130 by @katxiao
This release adds support for Python 3.9, and updates dependencies to accept the latest versions when possible.
- Add support for Python 3.9 - Issue #127 by @katxiao
- Add pip check worflow - Issue #124 by @pvk-developer
- Fix meta.yaml dependencies - PR #119 by @fealho
- Upgrade dependency ranges - Issue #118 by @katxiao
This release fixed a bug where passing a json file as configuration for a multi-table synthesizer crashed the model.
It also adds a number of fixes and enhancements, including: (1) a function and CLI command to list the available synthesizer names,
(2) a curate set of dependencies and making Gretel into an optional dependency, (3) updating Gretel to use temp directories,
(4) using nvidia-smi to get the number of gpus and (5) multiple dockerfile updates to improve functionality.
- Bug when using JSON configuration for multiple multi-table evaluation - Issue #115 by @pvk-developer
- Use nvidia-smi to get number of gpus - PR #113 by @katxiao
- List synthesizer names - Issue #82 by @fealho
- Use nvidia base for dockerfile - PR #108 by @katxiao
- Add Makefile target to install gretel and ydata - PR #107 by @katxiao
- Curate dependencies and make Gretel optional - PR #106 by @csala
- Update gretel checkpoints to use temp directory - PR #105 by @katxiao
- Initialize variable before reference - PR #104 by @katxiao
This release adds new synthesizers for Gretel and ydata, and creates a Docker image for SDGym. It also includes enhancements to the accepted SDGym arguments, adds a summary command to aggregate metrics, and adds the normalized score to the benchmark results.
- Add normalized score to benchmark results - Issue #102 by @katxiao
- Add max rows and max columns args - Issue #96 by @katxiao
- Automatically detect number of workers - Issue #97 by @katxiao
- Add summary function and command - Issue #92 by @amontanez24
- Allow jobs list/JSON to be passed - Issue #93 by @fealho
- Add ydata to sdgym - Issue #90 by @fealho
- Add dockerfile for sdgym - Issue #88 by @katxiao
- Add Gretel to SDGym synthesizer - Issue #87 by @amontanez24
This release adds new features to store results and cache contents into an S3 bucket as well as a script to collect results from a cache dir and compile a single results CSV file.
- Collect cached results from s3 bucket - Issue #85 by @katxiao
- Store cache contents into an S3 bucket - Issue #81 by @katxiao
- Store SDGym results into an S3 bucket - Issue #80 by @katxiao
- Add a way to collect cached results - Issue #79 by @katxiao
- Allow reading datasets from private s3 bucket - Issue #74 by @katxiao
- Typos in the sdgym.run function docstring documentation - Issue #69 by @sbrugman
Major rework of the SDGym functionality to support a collection of new features:
- Add relational and timeseries model benchmarking.
- Use SDMetrics for model scoring.
- Update datasets format to match SDV metadata based storage format.
- Centralize default datasets collection in the
sdv-datasetsS3 bucket. - Add options to download and use datasets from different S3 buckets.
- Rename synthesizers to baselines and adapt to the new metadata format.
- Add model execution and metric computation time logging.
- Add optional synthetic data and error traceback caching.
This version adds a rework of the benchmark function and a few new synthesizers.
- New CLI with
run,make-leaderboardandmake-summarycommands - Parallel execution via Dask or Multiprocessing
- Download datasets without executing the benchmark
- Support for python from 3.6 to 3.8
sdv.tabular.CTGANsdv.tabular.CopulaGANsdv.tabular.GaussianCopulaOneHotsdv.tabular.GaussianCopulaCategoricalsdv.tabular.GaussianCopulaCategoricalFuzzy
New updated leaderboard and minor improvements.
- Add parameters for PrivBNSynthesizer - Issue #37 by @csala
New Becnhmark API and lots of improved documentation.
- The benchmark function now returns a complete leaderboard instead of only one score
- Class Synthesizers can be directly passed to the benchmark function
- One hot encoding errors in the Independent, VEEGAN and Medgan Synthesizers.
- Proper usage of the
evalmode during sampling. - Fix improperly configured datasets.
First release to PyPi