Generating the corpus

Download the Wikipedia pt-br articles dump:

curl https://dumps.wikimedia.org/ptwiki/latest/ptwiki-latest-pages-articles-multistream-index.txt.bz2 --create-dirs -o data/ptwiki-latest-pages-articles-multistream-index.txt.bz2
curl https://dumps.wikimedia.org/ptwiki/latest/ptwiki-latest-pages-articles-multistream.xml.bz2 --create-dirs -o data/ptwiki-latest-pages-articles-multistream.xml.bz2

The wikipedia dump will have all articles in the wiki text format that is like a markdown with special tokens. So in order to get only the text we need to transform the wiki format to raw text. We could use the pythons gensim.corpora.WikiCorpus but its tokenizer is not so good for Portuguese. So I ended up using the wikiextractor and then I cleanup the text myself using another script. So, clone and execute the wikiextractor to transform the xml data into text:

git clone https://github.com/attardi/wikiextractor.git
cd ./wikiextractor
python ./WikiExtractor.py --no-templates -o ../data/ptwiki-articles-text/ -b 10M -c ../data/ptwiki-latest-pages-articles-multistream.xml.bz2
cd ..

This process will generate multiple compressed files of 10MB of wiki articles texts in the following format:

<doc id="2" url="http://it.wikipedia.org/wiki/Harmonium">
Harmonium.
L'harmonium è uno strumento musicale azionato con una tastiera, detta manuale.
Sono stati costruiti anche alcuni harmonium con due manuali.
...
</doc>

At the time of writing there was 1000400 documents in the ptwiki-dump. =]

Now that we have the wikipedia texts, we can start the pre-processing of the files.

python scripts/preprocess.py ./data/ptwiki-articles-text/ -o ./data/ptwiki-articles-text-cleaned

This script will do the following:

Breaks into multiple sentences using nltk.data.load('tokenizers/punkt/portuguese.pickle').
It will not change the case. (Later I'll use a POS parser that have a better accuracy if I maintain this)
Remove sentences with less than 4 words.
It will allow abbreviations, like 'Dr.'.
It will keep words with '-', like 'guarda-chuva'.
All emails are mapped to a EMAIL token.
All numbers are mapped to 0 token.
All urls are mapped to URL token.
Different quotes are standardized.
Different hiphen are standardized.
HTML strings are removed.
All text between brackets are removed.

These steps will generate a pt-BR corpus with:

1.6GB
9896520 sentences
251193592 tokens
3137040 unique tokens

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
scripts		scripts
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generating the corpus

About

Releases

Packages

Languages

License

eberlitz/pt-br-corpus

Folders and files

Latest commit

History

Repository files navigation

Generating the corpus

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages