Background
I've been integrating a dictionary lookup tool in my recent project to help others learn words from games. The following are my goals.
Goals
These goals define user input and the desired output.
1. Get longest possible entry of user input
Input: 自己紹介(じこしょうかい)
Desired Output: 自己紹介, not 自己 or 自
2. Context-unaware
Input: 牛 in 牛丼を食べてる
Desired output: 牛(うし), but not ぎゅう as in ぎゅうどん(牛丼)
3. Parse conjugated verbs and adjectives
Input 食べて in 牛丼を食べてる
Desired Output: 食べる(たべる)
4. Dictionary used in Parser independent from dictionary used for the entry's glossary.
(More later.)
Our desired output assumes a certain degree of Japanese grammatical knowledge from the user. When the user selects 折 from 折角 we assume they want to know what 折 means instead of 折角. When they select 外国人 they want to look up the entire compound noun instead of its individual characters.
Setting up
Dictionary - JMDict
We will be using JMDict, a freely available Japanese-to-English library. You can find Japanese to other language libraries on the Yomichan project.
from pathlib import Path
import zipfile
import json
SCRIPT_DIR = Path(__file__).parent
dictionary_map = {}
def load_dictionary(dictionary):
output_map = {}
archive = zipfile.ZipFile(dictionary, 'r')
result = list()
for file in archive.namelist():
if file.startswith('term'):
with archive.open(file) as f:
data = f.read()
d = json.loads(data.decode("utf-8"))
result.extend(d)
for entry in result:
if (entry[0] in output_map):
output_map[entry[0]].append(entry)
else:
output_map[entry[0]] = [entry] # Using headword as key for finding the dictionary entry
return output_map
def setup():
global dictionary_map
load_dictionary(str(Path(SCRIPT_DIR, 'dictionaries', 'jmdict_english.zip')))
To load our dictionary, we unzip the file and save it as a map with all its entries as the keys. We also check entries with repeat glossaries and add them to its list of glossaries.
Parser - Sudachi
pip install sudachipy
pip install sudachidict_small
To save space, we will use Sudachi's small dictionary instead of its core (70Mb).
from sudachipy import tokenizer
from sudachipy import dictionary
tokenizer_obj = dictionary.Dictionary(dict_type='small').create()
mode = tokenizer.Tokenizer.SplitMode.A
There are three modes in Sudachi - A, B, and C. Mode A parses words in its longest possible form where C its shortest. For our use case we will stick to mode A since we want the longest.
Putting it together
def look_up(word):
word = word.strip()
if word not in dictionary_map:
m = tokenizer_obj.tokenize(word, mode)[0]
word = m.dictionary_form()
if word not in dictionary_map:
return None
result = [{
'headword': entry[0],
'reading': entry[1],
'tags': entry[2],
'glossary_list': entry[5],
'sequence': entry[6]
} for entry in dictionary_map[word]]
return result
We first remove any unnecessary white spaces around our word then we directly check if it exists in our dictionary. This way we can get nouns like 牛丼 immediately without having to parse them.
After that we parse them with Sudachi mode A and get the dictionary_form() of the word and look that up in our own dictionary instead of using the parser's dictionary.
The final result is reformatted and returned.
(env) $ python
>>> setup()
>>> print(look_up('牛丼'))
[{'headword': '牛丼', 'reading': 'ぎゅうどん', 'tags': 'n', 'glossary_list': ['rice covered with beef and vegetables'], 'sequence': 1845250}]
>>> print(look_up('食べて'))
[{'headword': '食べる', 'reading': 'たべる', 'tags': 'v1 vt', 'glossary_list': ['to eat'], 'sequence': 1358280}, {'headword': '食べる', 'reading': 'たべ
る', 'tags': 'v1 vt', 'glossary_list': ['to live on (e.g. a salary)', 'to live off', 'to subsist on'], 'sequence': 1358280}]
>>> print(look_up('自己紹介'))
[{'headword': '自己紹介', 'reading': 'じこしょうかい', 'tags': 'n vs', 'glossary_list': ['self-introduction'], 'sequence': 1317650}]
Let me know if this was helpful.
Top comments (0)