Skip to content

Commit b7148cc

Browse files
committedDec 18, 2024··
Finish & publish WordNet post
1 parent 02508ad commit b7148cc

File tree

3 files changed

+30
-9
lines changed

3 files changed

+30
-9
lines changed
 

‎_drafts/2024-12-15-sqlite-word-dictionary-from-wordnet.md ‎_posts/2024-12-18-sqlite-word-dictionary-from-wordnet.md

+30-9
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,9 @@ All code in this article is available in a [WordNetToSQLite repo](https://github
1515

1616
For an upcoming word game app, I needed a dictionary of words. I wanted to know the type of word (noun / verb / adjective / adverb), and a definition for each. Plus, it should only include sensible words (e.g. no proper nouns, acronyms, or profanity).
1717

18-
I decided to prefill a **SQLite database** and ship it with my app, since I can easily update it by just shipping a new database, or even remotely with SQL. Android also has good support for retrieving the data from SQLite.
18+
I decided to prefill a **SQLite database** and ship it with my app, since I can easily update it by just shipping a new database (or even remotely with SQL!). Android also has good support for retrieving the data from SQLite.
1919

20-
However, finding a suitable list of words was tricky! I found plenty of sources containing just words, or with no information on source or licensing. Eventually, I discovered [Princeton University's WordNet](https://wordnet.princeton.edu/) exists, and is free to use! There's also a [more up to date fork](https://github.com/globalwordnet/english-wordnet) (2024 instead of 2006).
20+
However, finding a suitable list of words was tricky! I found plenty of sources containing just words, or with no information on source or licensing. Eventually, I discovered [Princeton University's WordNet](https://wordnet.princeton.edu/) exists, and luckily it's free to use and has a very liberal license. There's also a [more up to date fork](https://github.com/globalwordnet/english-wordnet) (2024 instead of 2006).
2121

2222
However, it contains a lot of unneeded information and complexity, and is 33MB+ uncompressed. Time to get filtering…
2323

@@ -28,21 +28,21 @@ If you wish to recreate [`words.db`](https://github.com/JakeSteam/WordNetToSQLit
2828
1. Obtain a WordNet format database.
2929
- I used [a regularly updated fork](https://github.com/globalwordnet/english-wordnet) (2024 edition, WNDB format)
3030
- You can also use the original WordNet files from 2006 (`WNdb-3.0.tar.gz` from [WordNet](https://wordnet.princeton.edu/download/current-version))
31-
2. Extract it, and place the `data.x` files in `/wordnet-data/`.
31+
2. Extract your download, and place the `data.x` files in `/wordnet-data/`.
3232
3. Run `py wordnet-to-sqlite.py`.
3333
4. In a minute, you'll have a word database!
3434

3535
Out of the box, the script takes ~60 seconds to run. This slightly slow speed is an intentional trade-off in exchange for having full control over the language filter (see [profanity removal](#profanity-removal)).
3636

3737
## Notes on results
3838

39-
The database contains over 73k word & word type combinations, each with a definition. I use the open source [DB Browser for SQLite](https://sqlitebrowser.org/) to browse the results, looking something like this:
39+
The database contains over 71k word & word type combinations, each with a definition. I use the open source [DB Browser for SQLite](https://sqlitebrowser.org/) to browse the results, looking something like this:
4040

4141
[![](/assets//images/2024/sqlite-browser.png)](/assets//images/2024/sqlite-browser.png)
4242

4343
### Schema definition
4444

45-
Only one definition per word for the same `type` are combined (e.g. with the noun `article`, but not the verb):
45+
Only one definition per word for the same `type` is used (e.g. with the noun `article`, but not the verb):
4646

4747
- `word`:
4848
- Any words with uppercase letters (e.g. proper nouns) are removed.
@@ -61,7 +61,7 @@ Only one definition per word for the same `type` are combined (e.g. with the nou
6161

6262
## Notes on code
6363

64-
Whilst [`wordnet-to-sqlite.py`](https://github.com/JakeSteam/WordNetToSQLite/blob/main/wordnet-to-sqlite.py) is under 100 lines of not-very-good Python, I'll briefly walk through what it does.
64+
Whilst [`wordnet-to-sqlite.py`](https://github.com/JakeSteam/WordNetToSQLite/blob/main/wordnet-to-sqlite.py) is under 100 lines of not-very-good Python and doesn't do anything _too_ crazy, I'll briefly walk through how it works.
6565

6666
### Raw data
6767

@@ -81,11 +81,24 @@ Further notes on WordNet's data files [are here](https://wordnet.princeton.edu/d
8181
4. If the word is valid, add it to the dictionary so long as it isn't already defined for the current word type. For example, a word might be used as a noun _and_ an adjective.
8282
5. Finally, output all these word, type, and definition rows into a SQLite database we prepared earlier.
8383

84+
Luckily, as Python is a very readable language, function definitions almost read like sentences:
85+
86+
```python
87+
def is_valid_word(word, definition):
88+
return (
89+
word.islower() and
90+
word.isalpha() and
91+
len(word) > 1 and
92+
not is_roman_numeral(word, definition) and
93+
not is_profanity(word)
94+
)
95+
```
96+
8497
### Profanity removal
8598

8699
Since this dictionary is for a child-friendly game, profane words should be removed if possible. Players are spelling the words themselves, so I don't need to filter _too_ aggressively, but slurs should never be possible.
87100

88-
The eventual solution is in [`/profanity/`](https://github.com/JakeSteam/WordNetToSQLite/tree/main/profanity), where `wordlist.json` is the words to remove, `whitelisted.txt` is the words I've manually removed from the wordlist, and `log.txt` is every removed word & definition.
101+
The eventual solution is in [`/profanity/`](https://github.com/JakeSteam/WordNetToSQLite/tree/main/profanity), where `wordlist.json` is the words to remove, `manually-removed.txt` & `manually-added.txt` are the words I've manually removed from / added to the wordlist, and `log.txt` is every removed word & definition.
89102

90103
#### Choice of package
91104

@@ -105,6 +118,12 @@ Whilst I now had a good word list, at this point I gave up using libraries, and
105118

106119
I implemented a solution that just checks every word (& word of definition) against a combined regex of every profane word. Yes, this is a bit slow and naive, but it finally gives correct results!
107120

121+
```python
122+
with open(wordlist_path, 'r', encoding='utf-8') as f:
123+
profane_words = set(json.load(f))
124+
combined_profanity_regex = re.compile(r'\b(?:' + '|'.join(re.escape(word) for word in profane_words) + r')\b', re.IGNORECASE)
125+
```
126+
108127
The script takes about a minute to parse the 161,705 word candidates, pull out 71,361 acceptable words, and store them in the database. Fast enough for a rarely run task.
109128

110129
### Optimisation
@@ -120,9 +139,11 @@ A few steps are taken to improve performance:
120139

121140
The approach taken to generate the database had quite a lot of trial and error. Multiple times I thought I was "done", then I'd check the database or raw data and discover I was incorrectly including or excluding data!
122141

123-
I'll absolutely tweak this script a bit as I go forward and my requirements change, but it's good enough for a starting point. Specifically, next steps are probably:
142+
[SQLite Browser](https://sqlitebrowser.org/) was extremely useful during this process, as the near-instant filtering helped me check profane words weren't slipping through. It also helped me catch a few times when technical data would leak into the definitions.
143+
144+
I'll absolutely tweak this script a bit as I go forward (I've implemented all my initial ideas since starting the article!) and my requirements change, but it's good enough for a starting point. Specifically, next steps are probably:
124145

125146
- ~~Try the [WordNet 3.1](https://wordnet.princeton.edu/download/current-version) database instead of 3.0, and see if there's any noticeable differences (there's no release notes!)~~ Tried, not much change
126147
- ~~Use [an open source fork](https://github.com/globalwordnet/english-wordnet), since it has yearly updates so should be higher quality than WordNet's 2006 data.~~ Done!
127148
- ~~Replace the current profanity library, since it takes far longer than the rest of the process, and pointlessly checks letter replacements (e.g. `h3ll0`) despite knowing my words are all lowercase letters.~~ Done!
128-
- Use the word + type combo as a composite primary key on the database, and ensure querying it is as efficient as possible.
149+
- ~~Use the word + type combo as a composite primary key on the database, and ensure querying it is as efficient as possible.~~ Done! Increased database size by ~20%, so will see if it's necessary.

‎assets/images/2024/sqlite-browser.png

-12.3 KB
Loading
-5.48 KB
Loading

0 commit comments

Comments
 (0)
Please sign in to comment.