Mathew Chan

Posted on Apr 26, 2021 • Edited on Apr 27, 2021

Making a Japanese Dictionary Lookup Tool with Sudachi in Python

#python #tutorial

Background

I've been integrating a dictionary lookup tool in my recent project to help others learn words from games. The following are my goals.

Goals

These goals define user input and the desired output.

1. Get longest possible entry of user input

Input: 自己紹介（じこしょうかい）
Desired Output: 自己紹介, not 自己 or 自

2. Context-unaware

Input: 牛 in 牛丼を食べてる
Desired output: 牛(うし), but not ぎゅう as in ぎゅうどん（牛丼）

3. Parse conjugated verbs and adjectives

Input 食べて in 牛丼を食べてる
Desired Output: 食べる（たべる）

4. Dictionary used in Parser independent from dictionary used for the entry's glossary.

(More later.)

Our desired output assumes a certain degree of Japanese grammatical knowledge from the user. When the user selects 折 from 折角 we assume they want to know what 折 means instead of 折角. When they select 外国人 they want to look up the entire compound noun instead of its individual characters.

Setting up

Dictionary - JMDict

We will be using JMDict, a freely available Japanese-to-English library. You can find Japanese to other language libraries on the Yomichan project.

from pathlib import Path
import zipfile
import json

SCRIPT_DIR = Path(__file__).parent 
dictionary_map = {}

def load_dictionary(dictionary):
    output_map = {}
    archive = zipfile.ZipFile(dictionary, 'r')

    result = list()
    for file in archive.namelist():
        if file.startswith('term'):
            with archive.open(file) as f:
                data = f.read()  
                d = json.loads(data.decode("utf-8"))
                result.extend(d)

    for entry in result:
        if (entry[0] in output_map):
            output_map[entry[0]].append(entry) 
        else:
            output_map[entry[0]] = [entry] # Using headword as key for finding the dictionary entry
    return output_map

def setup():
    global dictionary_map 
    load_dictionary(str(Path(SCRIPT_DIR, 'dictionaries', 'jmdict_english.zip')))

To load our dictionary, we unzip the file and save it as a map with all its entries as the keys. We also check entries with repeat glossaries and add them to its list of glossaries.

Parser - Sudachi

pip install sudachipy
pip install sudachidict_small

To save space, we will use Sudachi's small dictionary instead of its core (70Mb).

from sudachipy import tokenizer
from sudachipy import dictionary

tokenizer_obj = dictionary.Dictionary(dict_type='small').create()
mode = tokenizer.Tokenizer.SplitMode.A

There are three modes in Sudachi - A, B, and C. Mode A parses words in its longest possible form where C its shortest. For our use case we will stick to mode A since we want the longest.

Putting it together

def look_up(word):
    word = word.strip()
    if word not in dictionary_map:
        m = tokenizer_obj.tokenize(word, mode)[0]
        word = m.dictionary_form()
        if word not in dictionary_map:
            return None
    result = [{
        'headword': entry[0],
        'reading': entry[1],
        'tags': entry[2],
        'glossary_list': entry[5],
        'sequence': entry[6]
    } for entry in dictionary_map[word]]
    return result

We first remove any unnecessary white spaces around our word then we directly check if it exists in our dictionary. This way we can get nouns like 牛丼 immediately without having to parse them.

After that we parse them with Sudachi mode A and get the dictionary_form() of the word and look that up in our own dictionary instead of using the parser's dictionary.

The final result is reformatted and returned.

(env) $ python
>>> setup()
>>> print(look_up('牛丼'))
[{'headword': '牛丼', 'reading': 'ぎゅうどん', 'tags': 'n', 'glossary_list': ['rice covered with beef and vegetables'], 'sequence': 1845250}]
>>> print(look_up('食べて'))
[{'headword': '食べる', 'reading': 'たべる', 'tags': 'v1 vt', 'glossary_list': ['to eat'], 'sequence': 1358280}, {'headword': '食べる', 'reading': 'たべ
る', 'tags': 'v1 vt', 'glossary_list': ['to live on (e.g. a salary)', 'to live off', 'to subsist on'], 'sequence': 1358280}]
>>> print(look_up('自己紹介'))
[{'headword': '自己紹介', 'reading': 'じこしょうかい', 'tags': 'n vs', 'glossary_list': ['self-introduction'], 'sequence': 1317650}]

Let me know if this was helpful.

DEV Community

Making a Japanese Dictionary Lookup Tool with Sudachi in Python

Background

Goals

1. Get longest possible entry of user input

2. Context-unaware

3. Parse conjugated verbs and adjectives

4. Dictionary used in Parser independent from dictionary used for the entry's glossary.

Setting up

Dictionary - JMDict

Parser - Sudachi

Putting it together

Top comments (0)

Read next

Dirty Code: Simple Rules to Avoid It

Using Weak Pointers in Go

使用 selenium 讀取需要登入會員的網頁

Day 4: Configuring CloudFront and Securing Your Website with HTTPS