|
| 1 | +WSDM Cup 2017 Vandalism Detection Task: Classification and Evaluation |
| 2 | +===================================================================== |
| 3 | + |
| 4 | +The [WSDM Cup 2017](https://www.wsdm-cup-2017.org/) was a data mining challenge held in conjunction with the 10th International Conference on Web Search and Data Mining (WSDM). The goal of the [vandalism detection task](https://www.wsdm-cup-2017.org/vandalism-detection.html) was to compute a vandalism score for each Wikidata revision denoting the likelihood of this revision being vandalism or similarly damaging. This is the classification and evaluation component for the baselines WDVD, ORES, and FILTER. The feature extraction can be done with the corresponding [feature extraction component](https://github.com/heindorf/wsdmcup17-wdvd-feature-extraction). |
| 5 | + |
| 6 | +Paper |
| 7 | +----- |
| 8 | + |
| 9 | +This source code forms the basis for the overview paper of the [vandalism detection task at WSDM Cup 2017](https://arxiv.org/abs/1712.05956). When using the code, please make sure to refer to it as follows: |
| 10 | + |
| 11 | +```TeX |
| 12 | +@inproceedings{heindorf2017overview, |
| 13 | + author = {Stefan Heindorf and |
| 14 | + Martin Potthast and |
| 15 | + Gregor Engels and |
| 16 | + Benno Stein}, |
| 17 | + title = {Overview of the Wikidata Vandalism Detection Task at {WSDM} Cup 2017}, |
| 18 | + booktitle = {{{WSDM Cup 2017 Notebook Papers}}, |
| 19 | + url = {https://arxiv.org/abs/1712.05956}, |
| 20 | + year = {2017} |
| 21 | +} |
| 22 | +``` |
| 23 | + |
| 24 | +The code is based on the [Wikidata Vandalism Detector 2016](https://doi.acm.org/10.1145/2983323.2983740): |
| 25 | + |
| 26 | +```TeX |
| 27 | +@inproceedings{heindorf2016vandalism, |
| 28 | + author = {Stefan Heindorf and |
| 29 | + Martin Potthast and |
| 30 | + Benno Stein and |
| 31 | + Gregor Engels}, |
| 32 | + title = {Vandalism Detection in Wikidata}, |
| 33 | + booktitle = {{CIKM}}, |
| 34 | + pages = {327--336}, |
| 35 | + publisher = {{ACM}}, |
| 36 | + url = {https://doi.acm.org/10.1145/2983323.2983740} |
| 37 | + year = {2016} |
| 38 | +} |
| 39 | +``` |
| 40 | + |
| 41 | +Classification and Evaluation Component |
| 42 | +--------------------------------------- |
| 43 | + |
| 44 | +### Requirements |
| 45 | + |
| 46 | +The code was tested with Python 3.5.2, 64 Bit under Windows 10. |
| 47 | + |
| 48 | +### Installation |
| 49 | + |
| 50 | +We recommend [Miniconda](http://conda.pydata.org/miniconda.html) for easy installation on many platforms. |
| 51 | + |
| 52 | +1. Create new environment: `conda create --name wsdmcup17 python=3.5.2 --file requirements.txt` |
| 53 | +2. Activate environment: `activate wsdmcup17` |
| 54 | +3. Copy the [AUCCalculator](http://mark.goadrich.com/programs/AUC/) to the folder `lib` |
| 55 | + |
| 56 | +### Execute Classification |
| 57 | + |
| 58 | +Usage: |
| 59 | + |
| 60 | + python wsdmcup17_classification.py FEATURES TRUTH RESULTS |
| 61 | + |
| 62 | +Given a FEATURES file and TRUTH files (in bz2 format), splits the dataset, performs the classification and stores all results with the RESULTS prefix. |
| 63 | + |
| 64 | +Example: |
| 65 | + |
| 66 | + python wsdmcup17_classification.py |
| 67 | + 'features.csv.bz2' |
| 68 | + 'wdvc-2016/training/wdvc16_truth.csv.bz2;wdvc-2016/validation/wdvc16_2016_03_truth.csv.bz2;wdvc-2016/testing/wdvc16_2016_05_truth.csv.bz2' |
| 69 | + 'classification/20160101_0000000/20160101_0000000' |
| 70 | + |
| 71 | +### Configure Evaluation |
| 72 | + |
| 73 | +Configure the paths to the score files in the config file `teams.json`. For example, |
| 74 | + |
| 75 | + { |
| 76 | + "Buffaloberry": "wsdmcup17_buffaloberry.csv.bz2", |
| 77 | + "Conkerberry": "wsdmcup17_conkerberry.csv.bz2", |
| 78 | + "Honeyberry": "wsdmcup17_honeyberry.csv.bz2", |
| 79 | + "Loganberry": "wsdmcup17_loganberry.csv.bz2", |
| 80 | + "Riberry": "wsdmcup17_riberry.csv.bz2", |
| 81 | + "WDVD": "wsdmcup17_wdvd.csv.bz2", |
| 82 | + "ORES": "wsdmcup17_ores.csv.bz2", |
| 83 | + "FILTER": "wsdmcup17_filter.csv.bz2" |
| 84 | + } |
| 85 | + |
| 86 | +### Execute Evaluation |
| 87 | + |
| 88 | +Usage: |
| 89 | + |
| 90 | + python wsdmcup17_evaluation.py FEATURES TEAMS TRUTH RESULTS |
| 91 | + |
| 92 | +Given a FEATURES file, a TEAMS file with paths to scores, a TRUTH files, and a RESULTS prefix, evaluates the performance of teams and computes meta approach. |
| 93 | + |
| 94 | +Example: |
| 95 | + |
| 96 | + python wsdmcup17_evaluation.py |
| 97 | + 'features.csv.bz2' |
| 98 | + 'teams.json' |
| 99 | + 'wdvc-2016/testing/wdvc16_2016_05_truth.csv.bz2' |
| 100 | + 'evaluation/20160101_0000000/20160101_0000000' |
| 101 | + |
| 102 | +### Configuration |
| 103 | + |
| 104 | +The constants in the file config.py control what parts of the code are executed, the caching behavior as well as the level of parallelism. |
| 105 | + |
| 106 | +Naturally, there is a tradeoff between maximum parallelism and minimum memory consumption. When executing all parts of the code with 16 parallel processes, about 256 GB RAM are required. |
| 107 | + |
| 108 | +### Linting |
| 109 | + |
| 110 | +Run `flake8`. |
| 111 | + |
| 112 | +### Data Download |
| 113 | + |
| 114 | +- Feature file as computed with the [feature extraction component](https://github.com/heindorf/wsdmcup17-wdvd-feature-extraction): |
| 115 | + - [wsdmcup17_features.csv.bz2](https://groups.uni-paderborn.de/wdqa/wsdmcup17/wsdmcup17_features.csv.bz2) |
| 116 | +- Truth files from the [Wikidata Vandalism Corpus 2016](http://www.wsdm-cup-2017.org/vandalism-detection.html): |
| 117 | + - [wdvc16_truth.csv.bz2](https://groups.uni-paderborn.de/wdqa/wsdmcup17/wdvc16_truth.csv.bz2) |
| 118 | + - [wdvc16_2016_03_truth.csv.bz2](https://groups.uni-paderborn.de/wdqa/wsdmcup17/wdvc16_2016_03_truth.csv.bz2) |
| 119 | + - [wdvc16_2016_05_truth.csv.bz2](https://groups.uni-paderborn.de/wdqa/wsdmcup17/wdvc16_2016_05_truth.csv.bz2) |
| 120 | +- Score files from the [WSDM Cup 2017 Proceedings](https://www.wsdm-cup-2017.org/proceedings.html): |
| 121 | + - [wsdmcup17_buffaloberry.csv.bz2](https://groups.uni-paderborn.de/wdqa/wsdmcup17/wsdmcup17_buffaloberry.csv.bz2) |
| 122 | + - [wsdmcup17_conkerberry.csv.bz2](https://groups.uni-paderborn.de/wdqa/wsdmcup17/wsdmcup17_conkerberry.csv.bz2) |
| 123 | + - [wsdmcup17_honeyberry.csv.bz2](https://groups.uni-paderborn.de/wdqa/wsdmcup17/wsdmcup17_honeyberry.csv.bz2) |
| 124 | + - [wsdmcup17_loganberry.csv.bz2](https://groups.uni-paderborn.de/wdqa/wsdmcup17/wsdmcup17_loganberry.csv.bz2) |
| 125 | + - [wsdmcup17_riberry.csv.bz2](https://groups.uni-paderborn.de/wdqa/wsdmcup17/wsdmcup17_riberry.csv.bz2) |
| 126 | + - [wsdmcup17_wdvd.csv.bz2](https://groups.uni-paderborn.de/wdqa/wsdmcup17/wsdmcup17_wdvd.csv.bz2) |
| 127 | + - [wsdmcup17_ores.csv.bz2](https://groups.uni-paderborn.de/wdqa/wsdmcup17/wsdmcup17_ores.csv.bz2) |
| 128 | + - [wsdmcup17_filter.csv.bz2](https://groups.uni-paderborn.de/wdqa/wsdmcup17/wsdmcup17_filter.csv.bz2) |
| 129 | + - [wsdmcup17_meta.csv.bz2](https://groups.uni-paderborn.de/wdqa/wsdmcup17/wsdmcup17_meta.csv.bz2) |
| 130 | + |
| 131 | +Contact |
| 132 | +------- |
| 133 | + |
| 134 | +For questions and feedback please contact: |
| 135 | + |
| 136 | +Stefan Heindorf, Paderborn University |
| 137 | +Martin Potthast, Leipzig University |
| 138 | +Gregor Engels, Paderborn University |
| 139 | +Benno Stein, Bauhaus-Universität Weimar |
| 140 | + |
| 141 | +License |
| 142 | +------- |
| 143 | + |
| 144 | +The code by Stefan Heindorf, Martin Potthast, Gregor Engels, Benno Stein is licensed under a MIT license. |
0 commit comments