forked from HoloClean/holoclean
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge hcq-embedding-3 (HoloClean#97)
* Quantization and handling of numerical/mixed data. * Relocated test data into subdirectories. * Move active attributes to right after error detection and inside Dataset. Move correlations to separate module. * Refactor domain generation sort domain by co-occurrence probability and also domain generation for tuple embedding model. * Make co-occurrence featurizer only generate co-occurrence features for active attributes. Refactored domain to run estimator separately from domain generation. * Implemented TupleEmbedding model as an estimator. * Always load clean/ground truth as strings since we load/store raw data as strings. * Added featurizer for learned embeddings from TupleEmbedding model. * Support multiple layers during repair and made TupleEmbedding dump/load more sophisticated. * Improved validation logging and fixed a few bugs. * Improve validation in TupleEmbedding using pandas dataframes. * Suppose multi-dimensional quantization. * Quantize from dict rather than numerical attrs. * Mean/var normalize numerical attributes in context and added non-linearity to numerical spans. * Support specifying n-dimensional numerical attr groups vs splitting on columns. * Fixed None numerical_attr_groups. * Fixed report RMS error and converting to floats for quantization. * Added store_to_fb flag to load_data, added LR schedule to TupleEmbedding, added multiple ground truth in evaluation, changed EmbeddingFeat to return probability instead of embedding vectors. * Pre-split domain and ground truth values. * Fixed batch size argument in EmbeddingFeaturizer. * Removed numerical_attrs reference from Table. * Fix to how multi-ground truth is handled. Use simplified numerical regression TupleEmbedding with nonlinearity. * Max domain size need only be as large as largest for categorical attributes. * Remove domain for numerical attributes in TupleEmbedding. * Fixed some reference issues and added infer all mode. * Fixed _nan_ replacement, max_cat_domain being possibly nan, and evaluation for sample accuracy. * Do not weak label clean cells and fixed raw data in Logistic estimator. * Added ReLU after context for numerical targets in TupleEmbedding and refactored EmbeddingFeat to support numerical feature (RMSE) from TupleEmbedding. * Use cosine annealing with restart LR schedule and use weak_label instead of init. * Fixed memory issues with get_features and predict_pp_batch. * Fixed bug in get_features. * Added comment to EmbeddingFeat. * Finally fixed memory issues with torch.no_grad. * ConstraintFeaturizer runs on un-quantized values. * Do not drop single value cells (for evaluation). * Do not generate queries/feature for DC that does not pertain to attributes we are training on. * Fixed ConstraintFeaturizer to handle no DCs. * Removed deprecated code and added dropout. * Fixed calculation of num_batches in learning loop. * do not drop null inits cells with dom(len) <= 1 * Fixed z-scoring with 0 std and deleting e-notation numerical values. * Do not quantize if bins > unique. * Fixed some things in domain. * Added repair w/ validation set and removed multiple correct values in evaluation. * Fixed domain generation to include single value cells in domain. * Handle untrained context values properly and added code for domain co-occurrence in tupleembedding. * Regression fix for moving raw_data_dict before z-normalization and removed code references to domain_cooccur (for the most part).
- Loading branch information
Showing
10 changed files
with
2,522 additions
and
62 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.