Code for Paper Automatic Feasibility Study via Data Quality Analysis for ML: A Case-Study on Label Noise
This folder contains the code to reproduce the experimental results for the submission Automatic Feasibility Study via Data Quality Analysis for ML: A Case-Study on Label Noise
The following instructions show how the results can be obtained using Ubuntu 20.04.1, an Intel processor, CUDA Toolkit 10.1 and NVIDIA TITAN Xp GPUs.
For all datasets, embeddings are applied to the raw images and texts.
Prepare environment using Conda by running
conda env create -f environment.yml
using the supplied environment.yml
file. The command
will create an environment with name snoopy
. For more information on creating the
Conda environment please refer to
the documentation
.
Activate it with the command
conda activate snoopy
yelp folder should contain yelp_train.csv
and yelp_test.csv
files obtained
after preparing the YELP dataset. Due to spatial
constraints, we did not upload them together with the code, but can provide them upon
request.
Adjust the batch_size
values if needed in the embed.py
file and run the
following scripts
bash embed-cifar10.sh
bash embed-cifar100.sh
bash embed-mnist.sh
bash embed-imdb_reviews.sh
bash embed-sst2.sh
bash embed-yelp.sh
Embeddings and original datasets will be saved in the cache
folder and the
embedded datasets in the results
folder as
<dataset name>-<embedding name>.npz
where <dataset name>
corresponds to one of the
cifar10
cifar100
mnist
imdb_reviews
sst2
yelp
and <embedding name>
correspond to the shorter name given to each of the embeddings (
keys of the embeddings
dictionary in embed.py
). First available GPU will
be used for running inference.
Note: cache
and results
folders may take up a lot of space
after running the scripts.
Prepare the snoopy-cpu-analysis
environment by running
conda env create -f environment-cpu-analysis.yml
using the supplied environment-cpu-analysis.yml file and activate it with
conda activate snoopy-cpu-analysis
Run
python convergence.py <dataset name>
where the <dataset name>
is one of the datasets listed above.
This will produce convergence curve data stored in files
<dataset name>-<embedding name>-errs-cosine-0.0.npy
in the results
folder.
Prepare the snoopy-errors
environment by running
conda env create -f environment-errors.yml
using the supplied environment-errors.yml file and activate it with
conda activate snoopy-errors
Run
python errors.py
This will produce error data stored in files
<dataset name>-<embed name>-test.txt
in the results
folder.
Each file contains a JSON dictionary where the key at the first level denotes the amount
of label noise (one of the "0.0"
, "0.1"
, "0.2"
, ..., "1.0"
) and the key at the
second level denotes the method (one of the "GHP Upper"
, "GHP Lower"
, "1-NN"
, "1-NN cosine"
, "1-NN LOO"
, "1-NN cosine LOO"
, where 1-NN
and 1-NN LOO
are
using the Euclidean distance). For results related to GHP we only use the "GHP Lower"
key. The values are error values. All values except for the GHP are not Bayes error
estimates.
Note: For the yelp
dataset a lot of RAM is needed in order to compute the results
related to the GHP method.
File errors_data.txt
contains a JSON file that contains results for
all datasets and embeddings. Namely, the key at first level denotes the dataset, at the
second level the embedding and at the third level the split used (always test
).
Further levels correspond to individual .txt
files that are produced by
running errors.py
.
Please note that the file contains a subset of data that could be generated by
the errors.py
that is sufficient for the results presented in the paper.
Activate the already installed snoopy-cpu-analysis
environment
conda activate snoopy-cpu-analysis
Run
python lr.py results lr.txt yes
This will run LR on top of each embedding for all datasets for different amounts of label noise and hyper-parameters. Each experiment will be repeated 5 times. The resulting file lr.txt will contain a JSON dictionary where the key at the first level denotes the dataset, the key at second level the embedding, the key at third level the amount of label noise, the key at fifth level L2 regularization and SGD learning rate parameters, and the keys at the last level the achieved error rates and runtimes.
Activate the already installed snoopy
environment
conda activate snoopy
Run
python finetune/export_text_file_to_tfrecords.py
and then
bash finetune/run_all.sh
For the default hyper-parameters (best for both without noise) run the scripts without
any parameters. For changing the hyper-parameters for running the grid search for the
image task please set the argument --lr
to a value
in [0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03, 0.01]
, and the argument --reg
to a
value in [0.000001, 0.000003, 0.00001, 0.00003, 0.0001, 0.0003, 0.001]
.
Run
python finetune/collect_results.py
Activate the already installed snoopy
environment
conda activate snoopy
Run
bash autokeras/run_all.sh
Run
python autokeras/collect_results.py
Activate the already installed snoopy-cpu-analysis
environment
conda activate snoopy-cpu-analysis
Run
python plots.py <convergence curve path>
where the <convergence curve path>
should be a folder with subfolders
cifar10
cifar100
mnist
imdb_reviews
sst2
yelp
and each of the subfolders should contain the convergence curve files from Simulating
convergence curves. The program will in any case used the provided results in
the errors_data.txt
file.
We provide a notebook with the code required to reproduce the experimental results for both the evaluation of ber esitmations agains the LR, Fine-Tune and AutoKeras baselines, as well as the end-to-end use-case simulation in the file
end2end_ber_evaluation.ipynb
The results of all the 19 VTAB-1K datasets and 235 public embeddings from Huggingface are available via the directory vtab-results
.