Skip to content

Commit 919dcfc

Browse files
committed
initial commit
0 parents  commit 919dcfc

39 files changed

+6375
-0
lines changed

.gitignore

+35
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
6+
# Distribution / packaging
7+
lib/auc.jar
8+
9+
# IPython Notebook
10+
.ipynb_checkpoints
11+
12+
# Profiler
13+
/profile.pr
14+
15+
# DotEnv configuration
16+
.env
17+
18+
# Project settings
19+
.idea
20+
.project
21+
.pydevproject
22+
.ropeproject
23+
.spyderproject
24+
.spyproject
25+
26+
# Data
27+
*.bz2
28+
/models/
29+
/results/
30+
31+
# Temporary files
32+
/queue.p
33+
34+
# Markdown
35+
*.md.html

LICENSE

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2017 Stefan Heindorf, Martin Potthast, Gregor Engels, Benno Stein
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

+144
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
WSDM Cup 2017 Vandalism Detection Task: Classification and Evaluation
2+
=====================================================================
3+
4+
The [WSDM Cup 2017](https://www.wsdm-cup-2017.org/) was a data mining challenge held in conjunction with the 10th International Conference on Web Search and Data Mining (WSDM). The goal of the [vandalism detection task](https://www.wsdm-cup-2017.org/vandalism-detection.html) was to compute a vandalism score for each Wikidata revision denoting the likelihood of this revision being vandalism or similarly damaging. This is the classification and evaluation component for the baselines WDVD, ORES, and FILTER. The feature extraction can be done with the corresponding [feature extraction component](https://github.com/heindorf/wsdmcup17-wdvd-feature-extraction).
5+
6+
Paper
7+
-----
8+
9+
This source code forms the basis for the overview paper of the [vandalism detection task at WSDM Cup 2017](https://arxiv.org/abs/1712.05956). When using the code, please make sure to refer to it as follows:
10+
11+
```TeX
12+
@inproceedings{heindorf2017overview,
13+
author = {Stefan Heindorf and
14+
Martin Potthast and
15+
Gregor Engels and
16+
Benno Stein},
17+
title = {Overview of the Wikidata Vandalism Detection Task at {WSDM} Cup 2017},
18+
booktitle = {{{WSDM Cup 2017 Notebook Papers}},
19+
url = {https://arxiv.org/abs/1712.05956},
20+
year = {2017}
21+
}
22+
```
23+
24+
The code is based on the [Wikidata Vandalism Detector 2016](https://doi.acm.org/10.1145/2983323.2983740):
25+
26+
```TeX
27+
@inproceedings{heindorf2016vandalism,
28+
author = {Stefan Heindorf and
29+
Martin Potthast and
30+
Benno Stein and
31+
Gregor Engels},
32+
title = {Vandalism Detection in Wikidata},
33+
booktitle = {{CIKM}},
34+
pages = {327--336},
35+
publisher = {{ACM}},
36+
url = {https://doi.acm.org/10.1145/2983323.2983740}
37+
year = {2016}
38+
}
39+
```
40+
41+
Classification and Evaluation Component
42+
---------------------------------------
43+
44+
### Requirements
45+
46+
The code was tested with Python 3.5.2, 64 Bit under Windows 10.
47+
48+
### Installation
49+
50+
We recommend [Miniconda](http://conda.pydata.org/miniconda.html) for easy installation on many platforms.
51+
52+
1. Create new environment: `conda create --name wsdmcup17 python=3.5.2 --file requirements.txt`
53+
2. Activate environment: `activate wsdmcup17`
54+
3. Copy the [AUCCalculator](http://mark.goadrich.com/programs/AUC/) to the folder `lib`
55+
56+
### Execute Classification
57+
58+
Usage:
59+
60+
python wsdmcup17_classification.py FEATURES TRUTH RESULTS
61+
62+
Given a FEATURES file and TRUTH files (in bz2 format), splits the dataset, performs the classification and stores all results with the RESULTS prefix.
63+
64+
Example:
65+
66+
python wsdmcup17_classification.py
67+
'features.csv.bz2'
68+
'wdvc-2016/training/wdvc16_truth.csv.bz2;wdvc-2016/validation/wdvc16_2016_03_truth.csv.bz2;wdvc-2016/testing/wdvc16_2016_05_truth.csv.bz2'
69+
'classification/20160101_0000000/20160101_0000000'
70+
71+
### Configure Evaluation
72+
73+
Configure the paths to the score files in the config file `teams.json`. For example,
74+
75+
{
76+
"Buffaloberry": "wsdmcup17_buffaloberry.csv.bz2",
77+
"Conkerberry": "wsdmcup17_conkerberry.csv.bz2",
78+
"Honeyberry": "wsdmcup17_honeyberry.csv.bz2",
79+
"Loganberry": "wsdmcup17_loganberry.csv.bz2",
80+
"Riberry": "wsdmcup17_riberry.csv.bz2",
81+
"WDVD": "wsdmcup17_wdvd.csv.bz2",
82+
"ORES": "wsdmcup17_ores.csv.bz2",
83+
"FILTER": "wsdmcup17_filter.csv.bz2"
84+
}
85+
86+
### Execute Evaluation
87+
88+
Usage:
89+
90+
python wsdmcup17_evaluation.py FEATURES TEAMS TRUTH RESULTS
91+
92+
Given a FEATURES file, a TEAMS file with paths to scores, a TRUTH files, and a RESULTS prefix, evaluates the performance of teams and computes meta approach.
93+
94+
Example:
95+
96+
python wsdmcup17_evaluation.py
97+
'features.csv.bz2'
98+
'teams.json'
99+
'wdvc-2016/testing/wdvc16_2016_05_truth.csv.bz2'
100+
'evaluation/20160101_0000000/20160101_0000000'
101+
102+
### Configuration
103+
104+
The constants in the file config.py control what parts of the code are executed, the caching behavior as well as the level of parallelism.
105+
106+
Naturally, there is a tradeoff between maximum parallelism and minimum memory consumption. When executing all parts of the code with 16 parallel processes, about 256 GB RAM are required.
107+
108+
### Linting
109+
110+
Run `flake8`.
111+
112+
### Data Download
113+
114+
- Feature file as computed with the [feature extraction component](https://github.com/heindorf/wsdmcup17-wdvd-feature-extraction):
115+
- [wsdmcup17_features.csv.bz2](https://groups.uni-paderborn.de/wdqa/wsdmcup17/wsdmcup17_features.csv.bz2)
116+
- Truth files from the [Wikidata Vandalism Corpus 2016](http://www.wsdm-cup-2017.org/vandalism-detection.html):
117+
- [wdvc16_truth.csv.bz2](https://groups.uni-paderborn.de/wdqa/wsdmcup17/wdvc16_truth.csv.bz2)
118+
- [wdvc16_2016_03_truth.csv.bz2](https://groups.uni-paderborn.de/wdqa/wsdmcup17/wdvc16_2016_03_truth.csv.bz2)
119+
- [wdvc16_2016_05_truth.csv.bz2](https://groups.uni-paderborn.de/wdqa/wsdmcup17/wdvc16_2016_05_truth.csv.bz2)
120+
- Score files from the [WSDM Cup 2017 Proceedings](https://www.wsdm-cup-2017.org/proceedings.html):
121+
- [wsdmcup17_buffaloberry.csv.bz2](https://groups.uni-paderborn.de/wdqa/wsdmcup17/wsdmcup17_buffaloberry.csv.bz2)
122+
- [wsdmcup17_conkerberry.csv.bz2](https://groups.uni-paderborn.de/wdqa/wsdmcup17/wsdmcup17_conkerberry.csv.bz2)
123+
- [wsdmcup17_honeyberry.csv.bz2](https://groups.uni-paderborn.de/wdqa/wsdmcup17/wsdmcup17_honeyberry.csv.bz2)
124+
- [wsdmcup17_loganberry.csv.bz2](https://groups.uni-paderborn.de/wdqa/wsdmcup17/wsdmcup17_loganberry.csv.bz2)
125+
- [wsdmcup17_riberry.csv.bz2](https://groups.uni-paderborn.de/wdqa/wsdmcup17/wsdmcup17_riberry.csv.bz2)
126+
- [wsdmcup17_wdvd.csv.bz2](https://groups.uni-paderborn.de/wdqa/wsdmcup17/wsdmcup17_wdvd.csv.bz2)
127+
- [wsdmcup17_ores.csv.bz2](https://groups.uni-paderborn.de/wdqa/wsdmcup17/wsdmcup17_ores.csv.bz2)
128+
- [wsdmcup17_filter.csv.bz2](https://groups.uni-paderborn.de/wdqa/wsdmcup17/wsdmcup17_filter.csv.bz2)
129+
- [wsdmcup17_meta.csv.bz2](https://groups.uni-paderborn.de/wdqa/wsdmcup17/wsdmcup17_meta.csv.bz2)
130+
131+
Contact
132+
-------
133+
134+
For questions and feedback please contact:
135+
136+
Stefan Heindorf, Paderborn University
137+
Martin Potthast, Leipzig University
138+
Gregor Engels, Paderborn University
139+
Benno Stein, Bauhaus-Universität Weimar
140+
141+
License
142+
-------
143+
144+
The code by Stefan Heindorf, Martin Potthast, Gregor Engels, Benno Stein is licensed under a MIT license.

config.py

+56
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# -----------------------------------------------------------------------------
2+
# WSDM Cup 2017 Classification and Evaluation
3+
#
4+
# Copyright (c) 2017 Stefan Heindorf, Martin Potthast, Gregor Engels, Benno Stein
5+
#
6+
# Permission is hereby granted, free of charge, to any person obtaining a copy
7+
# of this software and associated documentation files (the "Software"), to deal
8+
# in the Software without restriction, including without limitation the rights
9+
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
10+
# copies of the Software, and to permit persons to whom the Software is
11+
# furnished to do so, subject to the following conditions:
12+
#
13+
# The above copyright notice and this permission notice shall be included in all
14+
# copies or substantial portions of the Software.
15+
#
16+
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
17+
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
18+
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
19+
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
20+
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
21+
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
22+
# SOFTWARE.
23+
# -----------------------------------------------------------------------------
24+
25+
__version__ = "0.0.10"
26+
27+
OUTPUT_PREFIX = None # is set during initialization
28+
29+
USE_VALIDATION_SET = False
30+
USE_TEST_SET = True
31+
STATISTICS_ENABLED = True
32+
FEATURE_RANKING_ENABLED = False
33+
OPTIMIZATION_ENABLED = False
34+
CLASSIFICATION_ENABLED = True
35+
CLASSIFICATION_GROUPS_ENABLED = False
36+
BASELINES_ENABLED = True
37+
ONLINE_LEARNING_ENABLED = False
38+
39+
LOADING_USE_MEMORY_CACHE = False
40+
LOADING_USE_DISK_CACHE = False
41+
PREPROCESSING_N_JOBS = 8
42+
FEATURE_RANKING_N_JOBS = 1
43+
CLASSIFICATION_N_JOBS = 4
44+
CLASSIFICATION_N_JOBS_SIMPLE_MI = 2
45+
OPTIMIZATION_N_JOBS = 4
46+
47+
BACKPRESSURE_WINDOW = 1
48+
49+
EVALUATION_MAX_POINTS_ON_CURVE = 10000
50+
51+
LOG_LEVEL = 'INFO'
52+
TEMP_PREFIX = 'wsdmcup17-'
53+
54+
55+
def get_globals():
56+
return globals()

lib/.gitkeep

Whitespace-only changes.

requirements.txt

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
numpy==1.11.2
2+
scipy==0.18.1
3+
pandas==0.19.1
4+
scikit-learn==0.18.0
5+
psutil==4.4.2

src/__init__.py

Whitespace-only changes.

src/classifiers/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)