Releases · sdv-dev/SDV

20 Apr 19:04

v1.0.1

7fd3ad4

v1.0.1 - 2023-04-20

This release improves the load_custom_constraint_classes method by removing the table_name parameter and just loading the constraint
for all tables instead. It also improves some error messages as well as removes some of the warnings that have been surfacing.

Support for sdtypes is enhanced by resolving a bug that was incorrecttly specifying Faker functions for some of them. Support for datetime formats has also been improved. Finally, the path argument in some save and load methods was changed to filepath for consistency.

New Features

Method load_custom_constraint_classes does not need table_name parameter - Issue #1354 by @R-Palazzo
Improve error message for invalid primary keys - Issue #1341 by @R-Palazzo
Add functionality to find version add-on - Issue #1309 by @frances-h

Bugs Fixed

Certain sdtypes cause Faker to raise error - Issue #1346 by @frances-h
Change path to filepath for load and save methods - Issue #1352 by @fealho
Some datetime formats cause InvalidDataError, even if the datetime matches the format - Issue #1136 by @amontanez24

Internal

Inequality constraint raises RuntimeWarning (invalid value encountered in log) - Issue #1275 by @frances-h
Pandas FutureWarning: Default dtype for Empty Series will be 'object' - Issue #1355 by @R-Palazzo
Pandas FutureWarning: Length 1 tuple will be returned - Issue #1356 by @R-Palazzo

Contributors

amontanez24, frances-h, and 2 other contributors

Assets 2

28 Mar 20:39

amontanez24

v1.0.0

9d61166

v1.0.0 - 2023-03-28

This is a major release that introduces a new API to the SDV aimed at streamlining the process of synthetic data generation! To achieve this, this release includes the addition of several large features.

Metadata

Some of the most notable additions are the new SingleTableMetadata and MultiTableMetadata classes. These classes enable a number of features that make it easier to synthesize your data correctly such as:

Automatic data detection - Calling metadata.detect_from_dataframe() or metadata.detect_from_csv() will populate the metadata autonomously with values it thinks represent the data.
Easy updating - Once an instance of the metadata is created, values can be easily updated using a number of methods defined in the API. For more info, view the docs.
Metadata validation - Calling metadata.validate() will return a report of any invalid definitions in the metadata specification.
Upgrading - Users with the previous metadata format can easily update to the new specification using the upgrade_metadata() method.
Saving and loading - The metadata itself can easily be saved to a json file and loaded back up later.

Class and Module Names

Another major change is the renaming of our core modeling classes and modules. The name changes are meant to highlight the difference between the underlying machine learning models, and the objects responsible for the end-to-end workflow of generating synthetic data. The main name changes are as follows:

tabular -> single_table
relational -> multi_table
timeseries -> sequential
BaseTabularModel -> BaseSingleTableSynthesizer
GaussianCopula -> GaussianCopulaSynthesizer
CTGAN -> CTGANSynthesizer
TVAE -> TVAESynthesizer
CopulaGan -> CopulaGANSynthesizer
PAR -> PARSynthesizer
HMA1 -> HMASynthesizer

In SDV 1.0, synthesizers are classes that take in metadata and handle data preprocessing, model training and model sampling. This is similar to the previous BaseTabularModel in SDV <1.0.

Synthetic Data Workflow

Synthesizers in SDV 1.0 define a clear workflow for generating synthetic data.

Synthesizers are initialized with a metadata class.
They can then be used to transform the data and apply constraints using the synthesizer.preprocess() method. This step also validates that the data matches the provided metadata to avoid errors in fitting or sampling.
The processed data can then be fed into the underlying machine learning model using synthesizer.fit_processed_data(). (Alternatively, data can be preprocessed and fit to the model using synthesizer.fit().)
Data can then be sampled using synthesizer.sample().

Each synthesizer class also provides a series of methods to help users customize the transformations their data goes through. Read more about that here.

Notice that the preprocessing and model fitting steps can now be separated. This can be helpful if preprocessing is time consuming or if the data has been processed externally.

Other Highly Requested Features

Another major addition is control over randomization. In SDV <1.0, users could set a seed to control the randomization for only some columns. In SDV 1.0, randomization is controlled for all columns. Every new call to sample generates new data, but the synthesizer's seed can be reset to the original state using synthesizer.reset_randomization(), enabling reproducibility.

SDV 1.0 adds accessibility and transparency into the transformers used for preprocessing and underlying machine learning models.

Using the synthesizer.get_transformers() method, you can access the transformers used to preprocess each column and view their properties. This can be useful for debugging and accessing privacy information like mappings used to mask data.
Distribution parameters learned by copula models can be accessed using the synthesizer.get_learned_distributions() method.

PII handling is improved by the following features:

Primary keys can be set to natural sdtypes (eg. SSN, email, name). Previously they could only be numerical or text.
The PseudoAnonymizedFaker can be used to provide consistent mapping to PII columns. As mentioned before, the mapping itself can be accessed by viewing the transformers for the column using synthesizer.get_transformers().
A bug causing PII columns to slow down modeling is patched.

Finally, the synthetic data can now be easily evaluated using the evaluate_quality() and run_diagnostic() methods. The data can be compared visually to the actual data using the get_column_plot() and get_column_pair_plot() methods. For more info on how to visualize or interpret the synthetic data evaluation, read the docs here.

Issues Resolved

New Features

Change auto_assign_transformers to handle id types - Issue #1325 by @pvk-developer
Change 'text' sdtype to 'id' - Issue #1324 by @frances-h
In upgrade_metadata, return the object instead of writing it to a JSON file - Issue #1319 by @frances-h
In upgrade_metadata index primary keys should be converted to text - Issue #1318 by @amontanez24
Add load_from_dict to SingleTableMetadata and MultiTableMetadata - Issue #1314 by @amontanez24
Throw a SynthesizerInputError if FixedCombinations constraint is applied to a column that is not boolean or categorical - Issue #1306 by @frances-h
Missing save and load methods for HMASynthesizer - Issue #1262 by @amontanez24
Better input validation when creating single and multi table synthesizers - Issue #1242 by @fealho
Better input validation on HMASynthesizer.sample - Issue #1241 by @R-Palazzo
Validate that relationship must be between a primary key and foreign key - Issue #1236 by @fealho
Improve update_column validation for pii attribute - Issue #1226 by @pvk-developer
Order the output of get_transformers() based on the metadata - Issue #1222 by @pvk-developer
Log if any numerical_distributions will not be applied - Issue #1212 by @fealho
Improve error handling for GaussianCopulaSynthesizer: numerical_distributions - Issue #1211 by @fealho
Improve error handling when validating constraints - Issue #1210 by @fealho
Add fake_companies demo - Issue #1209 by @amontanez24
Allow me to create a custom constraint class and use it in the same file - Issue #1205 by @amontanez24
Sampling should reset after retraining the model - Issue #1201 by @pvk-developer
Change function name HMASynthesizer.update_table_parameters --> set_table_parameters - Issue #1200 by @pvk-developer
Add get_info method to synthesizers - Issue #1199 by @fealho
Add evaluation methods to synthesizer - Issue #1190 by @fealho
Update evaluate.py to work with the new metadata - Issue #1186 by @fealho
Remove old code - Issue #1181 by @pvk-developer
Drop support for python 3.6 and add support for 3.10 - Issue #1176 by @fealho
Add constraint methods to MultiTableSynthesizers - Issue #1171 by @fealho
Update custom constraint workflow - Issue #1169 by @pvk-developer
Add get_constraints method to synthesizers - Issue #1168 by @pvk-developer
Migrate adding and validating constraints to BaseSynthesizer - Issue #1163 by @pvk-developer
Change metadata "SCHEMA_VERSION" --> "METADATA_SPEC_VERSION" - Issue #1139 by @amontanez24
Add ability to reset random sampling - Issue #1130 by @pvk-developer
Add get_available_demos - Issue #1129 by @fealho
Add demo loading functionality - Issue #1128 by @fealho
Use logging instead of printing in detect methods - Issue #1107 by @fealho
Add save and load methods to synthesizers - Issue #1106 by @pvk-developer
Add sampling methods to PARSynthesizer - Issue #1083 by @amontanez24
Add transformer methods to PARSynthesizer - Issue #1082 by @fealho
Add validate to PARSynthesizer - Issue #1081 by @amontanez24
Add preprocess and fit methods to PARSynthesizer - I...

Contributors

amontanez24, frances-h, and 3 other contributors

Assets 2

24 Jan 21:08

amontanez24

v0.18.0

b503075

v0.18.0 - 2023-01-24

This release adds suppport for Python 3.10 and drops support for 3.6.

Maintenance

Drop support for python 3.6 - Issue #1177 by @amontanez24
Support for python 3.10 - Issue #939 by @amontanez24
Support Python >=3.10,<4 - Issue #1000 by @amontanez24

Contributors

amontanez24

Assets 2

08 Dec 23:42

amontanez24

v0.17.2

b35f288

v0.17.2 - 2022-12-08

This release fixes a bug in the demo module related to loading the demo data with constraints. It also adds a name to the demo datasets. Finally, it bumps the version of SDMetrics used.

Maintenance

Upgrade SDMetrics requirement to 0.8.0 - Issue #1125 by @katxiao

New Features

Provide a name for the default demo datasets - Issue #1124 by @amontanez24

Bugs Fixed

Cannot load_tabular_demo with metadata - Issue #1123 by @amontanez24

Contributors

katxiao and amontanez24

Assets 2

29 Sep 23:07

amontanez24

v0.17.1

b215cc1

v0.17.1 - 2022-09-29

This release bumps the dependency requirements to use the latest version of SDMetrics.

Maintenance

Patch release: Bump required version for SDMetrics - Issue #1010 by @katxiao

Contributors

katxiao

Assets 2

09 Sep 21:40

amontanez24

v0.17.0

2f91698

v0.17.0 - 2022-09-09

This release updates the code to use RDT version 1.2.0 and greater, so that those new features are now available in SDV. This changes the transformers that are available in SDV models to be those that are in RDT version 1.2.0. As a result, some arguments for initializing models have changed.

Additionally, this release fixes bugs related to loading models with custom constraints. It also fixes a bug that added NaNs to the index of sampled data when using sample_remaining_columns.

Bugs Fixed

Incorrect rounding in Custom Constraint example - Issue #941 by @amontanez24
Can't save the model if use the custom constraint - Issue #928 by @pvk-developer
User Guide code fixes - Issue #983 by @amontanez24
Index contains NaNs when using sample_remaining_columns - Issue #985 by @amontanez24
Cannot sample after loading a model with custom constraints: TypeError - Issue #984 by @pvk-developer
Set HyperTransformer config manually, based on Metadata if given - Issue #982 by @pvk-developer

New Features

Change default metrics for evaluate - Issue #949 by @fealho

Maintenance

Update the RDT version to 1.0 - Issue #897 by @pvk-developer

Contributors

amontanez24, fealho, and pvk-developer

Assets 2

22 Jul 03:02

amontanez24

v0.16.0

d728f5f

v0.16.0 - 2022-07-21

This release brings user friendly improvements and bug fixes on the SDV constraints, to help
users generate their synthetic data easily.

Some predefined constraints have been renamed and redefined to be more user friendly & consistent.
The custom constraint API has also been updated for usability. The SDV now automatically determines
the best handling_strategy to use for each constraint, attempting transform by default and
falling back to reject_sampling otherwise. The handling_strategy parameters are no longer
included in the API.

Finally, this version of SDV also unifies the parameters for all sampling related methods for
all models (including TabularPreset).

Changes to Constraints

GreatherThan constraint is now separated in two new constraints: Inequality, which is
intended to be used between two columns, and ScalarInequality, which is intended to be used
between a column and a scalar.
Between constraint is now separated in two new constraints: Range, which is intended to
be used between three columns, and ScalarRange, which is intended to be used between a column
and low and high scalar values.
FixedIncrements a new constraint that makes the data increment by a certain value.
New create_custom_constraint function available to create custom constraints.

Removed Constraints

Rounding Rounding is automatically being handled by the rdt.HyperTransformer.
ColumnFormula the create_custom_constraint takes place over this one and allows more
advanced usage for the end users.

New Features

Improve error message for invalid constraints - Issue #801 by @fealho
Numerical Instability in Constrained GaussianCopula - Issue #806 by @fealho
Unify sampling params for reject sampling - Issue #809 by @amontanez24
Split GreaterThan constraint into Inequality and ScalarInequality - Issue #814 by @fealho
Split Between constraint into Range and ScalarRange - Issue #815 @pvk-developer
Change columns to column_names in OneHotEncoding and Unique constraints - Issue #816 by @amontanez24
Update columns parameter in Positive and Negative constraint - Issue #817 by @fealho
Create FixedIncrements constraint - Issue #818 by @amontanez24
Improve datetime handling in ScalarInequality and ScalarRange constraints - Issue #819 by @pvk-developer
Support strict boundaries even when transform strategy is used - Issue #820 by @fealho
Add create_custom_constraint factory method - Issue #836 by @fealho

Internal Improvements

Remove handling_strategy parameter - Issue #833 by @amontanez24
Remove fit_columns_model parameter - Issue #834 by @pvk-developer
Remove the ColumnFormula constraint - Issue #837 by @amontanez24
Move table_data.copy to base class of constraints - Issue #845 by @fealho

Bugs Fixed

Numerical Instability in Constrained GaussianCopula - Issue #801 by @tlranda and @fealho
Fix error message for FixedIncrements - Issue #865 by @pvk-developer
Fix constraints with conditional sampling - Issue #866 by @amontanez24
Fix error message in ScalarInequality - Issue #868 by @pvk-developer
Cannot use max_tries_per_batch on sample: TypeError: sample() got an unexpected keyword argument 'max_tries_per_batch' - Issue #885 by @amontanez24
Conditional sampling + batch size: ValueError: Length of values (1) does not match length of index (5) - Issue #886 by @amontanez24
TabularPreset doesn't support new sampling parameters - Issue #887 by @fealho
Conditional Sampling: batch_size is being set to None by default? - Issue #889 by @amontanez24
Conditional sampling using GaussianCopula inefficient when categories are noised - Issue #910 by @amontanez24

Documentation Changes

Show the API for TabularPreset models - Issue #854 by @katxiao
Update handling constraints doc - Pull Request #856 by @amontanez24
Update custom costraints documentation - Pull Request #857 by @pvk-developer

Contributors

katxiao, amontanez24, and 3 other contributors

Assets 2

25 May 20:52

amontanez24

v0.15.0

b7392bb

v0.15.0 - 2022-05-25

This release improves the speed of the GaussianCopula model by removing logic that previously searched for the appropriate distribution to use. It also fixes a bug that was happening when conditional sampling was used with the TabularPreset.

The rest of the release focuses on making changes to improve constraints including changing the UniqueCombinations constraint to FixedCombinations, making the Unique constraint work with missing values and erroring when null values are seen in the OneHotEncoding constraint.

New Features

Silence warnings coming from univariate fit in copulas - Issue #769 by @pvk-developer
Remove parameters related to distribution search and change default - Issue #767 by @fealho
Update the UniqueCombinations constraint - Issue #793 by @fealho
Make Unique constraint works with nans - Issue #797 by @fealho
Error out if nans in OneHotEncoding - Issue #800 by @amontanez24

Bugs Fixed

Unable to sample conditionally in Tabular_Preset model - Issue #796 by @katxiao

Documentation Changes

Support GPU computing and progress track? - Issue #478 by @fealho

Contributors

katxiao, amontanez24, and 2 other contributors

Assets 2

03 May 16:21

katxiao

v0.14.1

cdbd2ae

v0.14.1 - 2022-05-03

This release adds a TabularPreset, available in the sdv.lite module, which allows users to easily optimize a tabular model for speed.
In this release, we also include bug fixes for sampling with conditions, an unresolved warning, and setting field distributions. Finally,
we include documentation updates for sampling and the new TabularPreset.

Bugs Fixed

Sampling with conditions={column: 0.0} for float columns doesn't work - Issue #525 by @shlomihod and @tssbas
resolved FutureWarning with Pandas replaced append by concat - Issue #759 by @Deathn0t
Field distributions bug in CopulaGAN - Issue #747 by @katxiao
Field distributions bug in GaussianCopula - Issue #746 by @katxiao

New Features

Set default transformer to categorical_fuzzy - Issue #768 by @amontanez24
Model nulls normally when tabular preset has constraints - Issue #764 by @katxiao
Don't modify my metadata object - Issue #754 by @amontanez24
Presets should be able to handle constraints - Issue #753 by @katxiao
Change preset optimize_for --> name - Issue #749 by @katxiao
Create a speed optimized Preset - Issue #716 by @katxiao

Documentation Changes

Add tabular preset docs - Issue #777 by @katxiao
sdv.sampling module is missing from the API - Issue #740 by @katxiao

Contributors

katxiao, Deathn0t, and 3 other contributors

Assets 2

21 Mar 15:38

katxiao

v0.14.0

f0df707

v0.14.0 - 2022-03-21

This release updates the sampling API and splits the existing functionality into three methods - sample, sample_conditions,
and sample_remaining_columns. We also add support for sampling in batches, displaying a progress bar when sampling with more than one batch,
sampling deterministically, and writing the sampled results to an output file. Finally, we include fixes for sampling with conditions
and updates to the documentation.

Bugs Fixed

Fix write to file in sampling - Issue #732 by @katxiao
Conditional sampling doesn't work if the model has a CustomConstraint - Issue #696 by @katxiao

New Features

Updates to GaussianCopula conditional sampling methods - Issue #729 by @katxiao
Update conditional sampling errors - Issue #730 by @katxiao
Enable Batch Sampling + Progress Bar - Issue #693 by @katxiao
Create sample_remaining_columns() method - Issue #692 by @katxiao
Create sample_conditions() method - Issue #691 by @katxiao
Improve sample() method - Issue #690 by @katxiao
Create Condition object - Issue #689 by @katxiao
Is it possible to generate data with new set of primary keys? - Issue #686 by @katxiao
No way to fix the random seed? - Issue #157 by @katxiao
Can you set a random state for the sdv.tabular.ctgan.CTGAN.sample method? - Issue #515 by @katxiao
generating different synthetic data while training the model multiple times. - Issue #299 by @katxiao

Documentation Changes

Typo in the document documentation - Issue #680 by @katxiao

Contributors

katxiao

Assets 2

Releases: sdv-dev/SDV

v1.0.1 - 2023-04-20

New Features

Bugs Fixed

Internal

Contributors

v1.0.0 - 2023-03-28

Metadata

Class and Module Names

Synthetic Data Workflow

Other Highly Requested Features

Issues Resolved

New Features

Contributors

v0.18.0 - 2023-01-24

Maintenance

Contributors

v0.17.2 - 2022-12-08

Maintenance

New Features

Bugs Fixed

Contributors

v0.17.1 - 2022-09-29

Maintenance

Contributors

v0.17.0 - 2022-09-09

Bugs Fixed

New Features

Maintenance

Contributors

v0.16.0 - 2022-07-21

Changes to Constraints

Removed Constraints

New Features

Internal Improvements

Bugs Fixed

Documentation Changes

Contributors

v0.15.0 - 2022-05-25

New Features

Bugs Fixed

Documentation Changes

Contributors

v0.14.1 - 2022-05-03

Bugs Fixed

New Features

Documentation Changes

Contributors

v0.14.0 - 2022-03-21

Bugs Fixed

New Features

Documentation Changes

Contributors