In this competition, we will be asking you to score a set of about fourteen thousand comments. Pairs of comments were presented to expert raters, who marked one of two comments more harmful — each according to their own notion of toxicity. In this contest, when you provide scores for comments, they will be compared with several hundred thousand rankings. Your average agreement with the raters will determine your individual score. In this way, we hope to focus on ranking the severity of comment toxicity from innocuous to outrageous, where the middle matters as much as the extremes.
- Ruddit Dataset
- Jigsaw Rate Severity of Toxic Comments
- toxic-task
- Team Merge Deadline - January 31, 2022
- Submission Deadline - February 7, 2022
- Jeongwon Kim ([email protected])
- Jaewoo Park ([email protected])
- Youngmin Paik ([email protected])
- Kyubin Kim ([email protected])
- BERT Model(Electra + RoBERTa + DeBERTa + Distilbart) Ensemble
- BERT Model(Electra + RoBERTa + DeBERTa + Distilbart) + Linear Model(Tfidf + LinearRegression) Ensemble
This task is a PIPELINE to experiment with PLM(Pretrained Language Model), which is often used in NLP tasks, and the following models are tested
RoBERTa base
RoBERTa large
ELECTRA
Muppet RoBERTa
DeBERTa
BART
Distilbart
GPT2
It was judged that BERT based Model(eg. BERT, RoBERTa) would be more robust than Linear Model(eg. linearregressor) because it can respond to new words and takes into account the context.
- Modeled without considering the Cross Validation score and only considering the Public Leaderboard score, showing disappointing results for the Private Leaderboard score
- We trained the model by combining multiple dataset(eg. toxic-task, Ruddit Dataset), so model was weak with Private dataset
- Fetch Pretrained Models
$ sh ./download_pretrained_models.sh
- Train
$ cd /jigsaw-toxic-severity-rating/jigsaw_toxic_severity_rating
$ python3 run_train.py \
"--config-file", "/jigsaw-toxic-severity-rating/configs/roberta.yaml", \
"--train", \
"--training-keyword", "roberta-test"
You can set launch.json
for vscode as follow:
{
"version": "0.2.0",
"configurations": [
{
"name": "Python: Current File",
"type": "python",
"request": "launch",
"program": "run_train.py",
"console": "integratedTerminal",
"cwd": "${workspaceFolder}/jigsaw_toxic_severity_rating",
"args": ["--config-file", "../configs/roberta.yaml", "--train"]
}
]
}
- Python 3.7 (To match with Kaggle environment)
- Conda
- git
- git-lfs
- CUDA 11.3 + PyTorch 1.10.1
Pytorch version may vary depanding on your hardware configurations.
git clone https://github.com/medal-contender/nbme-score-clinical-patient-notes.git
conda create -n nbme python=3.7
conda activate nbme
# PyTorch installation process may vary depending on your hardware
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
cd nbme-score-clinical-patient-notes
pip install -r requirements.txt
Run this in WSL (or WSL2)
./download_pretrained_models.sh
$ git pull
If you have local changes, and it causes to abort git pull
, one way to get around this is the following:
# removing the local changes
$ git stash
# update
$ git pull
# put the local changes back on top of the recent update
$ git stash pop