You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: _posts/2024-12-18-sqlite-word-dictionary-from-wordnet.md
+30-9
Original file line number
Diff line number
Diff line change
@@ -15,9 +15,9 @@ All code in this article is available in a [WordNetToSQLite repo](https://github
15
15
16
16
For an upcoming word game app, I needed a dictionary of words. I wanted to know the type of word (noun / verb / adjective / adverb), and a definition for each. Plus, it should only include sensible words (e.g. no proper nouns, acronyms, or profanity).
17
17
18
-
I decided to prefill a **SQLite database** and ship it with my app, since I can easily update it by just shipping a new database, or even remotely with SQL. Android also has good support for retrieving the data from SQLite.
18
+
I decided to prefill a **SQLite database** and ship it with my app, since I can easily update it by just shipping a new database (or even remotely with SQL!). Android also has good support for retrieving the data from SQLite.
19
19
20
-
However, finding a suitable list of words was tricky! I found plenty of sources containing just words, or with no information on source or licensing. Eventually, I discovered [Princeton University's WordNet](https://wordnet.princeton.edu/) exists, and is free to use! There's also a [more up to date fork](https://github.com/globalwordnet/english-wordnet) (2024 instead of 2006).
20
+
However, finding a suitable list of words was tricky! I found plenty of sources containing just words, or with no information on source or licensing. Eventually, I discovered [Princeton University's WordNet](https://wordnet.princeton.edu/) exists, and luckily it's free to use and has a very liberal license. There's also a [more up to date fork](https://github.com/globalwordnet/english-wordnet) (2024 instead of 2006).
21
21
22
22
However, it contains a lot of unneeded information and complexity, and is 33MB+ uncompressed. Time to get filtering…
23
23
@@ -28,21 +28,21 @@ If you wish to recreate [`words.db`](https://github.com/JakeSteam/WordNetToSQLit
28
28
1. Obtain a WordNet format database.
29
29
- I used [a regularly updated fork](https://github.com/globalwordnet/english-wordnet) (2024 edition, WNDB format)
30
30
- You can also use the original WordNet files from 2006 (`WNdb-3.0.tar.gz` from [WordNet](https://wordnet.princeton.edu/download/current-version))
31
-
2. Extract it, and place the `data.x` files in `/wordnet-data/`.
31
+
2. Extract your download, and place the `data.x` files in `/wordnet-data/`.
32
32
3. Run `py wordnet-to-sqlite.py`.
33
33
4. In a minute, you'll have a word database!
34
34
35
35
Out of the box, the script takes ~60 seconds to run. This slightly slow speed is an intentional trade-off in exchange for having full control over the language filter (see [profanity removal](#profanity-removal)).
36
36
37
37
## Notes on results
38
38
39
-
The database contains over 73k word & word type combinations, each with a definition. I use the open source [DB Browser for SQLite](https://sqlitebrowser.org/) to browse the results, looking something like this:
39
+
The database contains over 71k word & word type combinations, each with a definition. I use the open source [DB Browser for SQLite](https://sqlitebrowser.org/) to browse the results, looking something like this:
Only one definition per word for the same `type`are combined (e.g. with the noun `article`, but not the verb):
45
+
Only one definition per word for the same `type`is used (e.g. with the noun `article`, but not the verb):
46
46
47
47
-`word`:
48
48
- Any words with uppercase letters (e.g. proper nouns) are removed.
@@ -61,7 +61,7 @@ Only one definition per word for the same `type` are combined (e.g. with the nou
61
61
62
62
## Notes on code
63
63
64
-
Whilst [`wordnet-to-sqlite.py`](https://github.com/JakeSteam/WordNetToSQLite/blob/main/wordnet-to-sqlite.py) is under 100 lines of not-very-good Python, I'll briefly walk through what it does.
64
+
Whilst [`wordnet-to-sqlite.py`](https://github.com/JakeSteam/WordNetToSQLite/blob/main/wordnet-to-sqlite.py) is under 100 lines of not-very-good Python and doesn't do anything _too_ crazy, I'll briefly walk through how it works.
65
65
66
66
### Raw data
67
67
@@ -81,11 +81,24 @@ Further notes on WordNet's data files [are here](https://wordnet.princeton.edu/d
81
81
4. If the word is valid, add it to the dictionary so long as it isn't already defined for the current word type. For example, a word might be used as a noun _and_ an adjective.
82
82
5. Finally, output all these word, type, and definition rows into a SQLite database we prepared earlier.
83
83
84
+
Luckily, as Python is a very readable language, function definitions almost read like sentences:
85
+
86
+
```python
87
+
defis_valid_word(word, definition):
88
+
return (
89
+
word.islower() and
90
+
word.isalpha() and
91
+
len(word) >1and
92
+
not is_roman_numeral(word, definition) and
93
+
not is_profanity(word)
94
+
)
95
+
```
96
+
84
97
### Profanity removal
85
98
86
99
Since this dictionary is for a child-friendly game, profane words should be removed if possible. Players are spelling the words themselves, so I don't need to filter _too_ aggressively, but slurs should never be possible.
87
100
88
-
The eventual solution is in [`/profanity/`](https://github.com/JakeSteam/WordNetToSQLite/tree/main/profanity), where `wordlist.json` is the words to remove, `whitelisted.txt`is the words I've manually removed from the wordlist, and `log.txt` is every removed word & definition.
101
+
The eventual solution is in [`/profanity/`](https://github.com/JakeSteam/WordNetToSQLite/tree/main/profanity), where `wordlist.json` is the words to remove, `manually-removed.txt`& `manually-added.txt` are the words I've manually removed from / added to the wordlist, and `log.txt` is every removed word & definition.
89
102
90
103
#### Choice of package
91
104
@@ -105,6 +118,12 @@ Whilst I now had a good word list, at this point I gave up using libraries, and
105
118
106
119
I implemented a solution that just checks every word (& word of definition) against a combined regex of every profane word. Yes, this is a bit slow and naive, but it finally gives correct results!
107
120
121
+
```python
122
+
withopen(wordlist_path, 'r', encoding='utf-8') as f:
123
+
profane_words =set(json.load(f))
124
+
combined_profanity_regex = re.compile(r'\b(?:'+'|'.join(re.escape(word) for word in profane_words) +r')\b', re.IGNORECASE)
125
+
```
126
+
108
127
The script takes about a minute to parse the 161,705 word candidates, pull out 71,361 acceptable words, and store them in the database. Fast enough for a rarely run task.
109
128
110
129
### Optimisation
@@ -120,9 +139,11 @@ A few steps are taken to improve performance:
120
139
121
140
The approach taken to generate the database had quite a lot of trial and error. Multiple times I thought I was "done", then I'd check the database or raw data and discover I was incorrectly including or excluding data!
122
141
123
-
I'll absolutely tweak this script a bit as I go forward and my requirements change, but it's good enough for a starting point. Specifically, next steps are probably:
142
+
[SQLite Browser](https://sqlitebrowser.org/) was extremely useful during this process, as the near-instant filtering helped me check profane words weren't slipping through. It also helped me catch a few times when technical data would leak into the definitions.
143
+
144
+
I'll absolutely tweak this script a bit as I go forward (I've implemented all my initial ideas since starting the article!) and my requirements change, but it's good enough for a starting point. Specifically, next steps are probably:
124
145
125
146
-~~Try the [WordNet 3.1](https://wordnet.princeton.edu/download/current-version) database instead of 3.0, and see if there's any noticeable differences (there's no release notes!)~~ Tried, not much change
126
147
-~~Use [an open source fork](https://github.com/globalwordnet/english-wordnet), since it has yearly updates so should be higher quality than WordNet's 2006 data.~~ Done!
127
148
-~~Replace the current profanity library, since it takes far longer than the rest of the process, and pointlessly checks letter replacements (e.g. `h3ll0`) despite knowing my words are all lowercase letters.~~ Done!
128
-
- Use the word + type combo as a composite primary key on the database, and ensure querying it is as efficient as possible.
149
+
-~~Use the word + type combo as a composite primary key on the database, and ensure querying it is as efficient as possible.~~ Done! Increased database size by ~20%, so will see if it's necessary.
0 commit comments