New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
Starred by 11 users

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Jan 2013
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: All
Pri: 3
Type: Bug



Sign in to add a comment

Add Korean spell checking support

Project Member Reported by js...@chromium.org, Mar 16 2009

Issue description

The Korean spellcheck dictionary for Hunspell has been developed at 
http://code.google.com/p/spellcheck-ko

There's a firefox extension for Korean spellchecking using this dictionary.     

http://forums.mozilla.or.kr/viewtopic.php?p=38615  (sorry it's in Korean)

( http://www.box.net/shared/p6ddocas8m )

The developer of firefox extension also made an extension for openoffice. 
http://www.box.net/shared/670ugnuqip

Both use the same dictionary at code.google.com

Being highly agglutinative, Korean  is likely to 'suffer' from the same 
problem (huge memory use) as Hungarian given the way Hunspell stores words 
in memory (expanding all the possible combinations of 'stem + suffices' and 
storing them all in hash). 


 
Good to see my dictionary project in Chromium project.

Note that it requires hunspell 1.2.8 new feature ICONV and OCONV for Hangul Unicode
normalization. But AFAIK Chromium's hunspell version is a bit out of date. I think
Chromium could implement its own Hangul normalization routine instead.

Comment 2 by js...@chromium.org, Jul 1 2009

Blockedon: 14756

Comment 3 by mhm@chromium.org, Aug 1 2009

Comment 4 by mhm@chromium.org, Aug 1 2009

Hi all, I tried converting that korean aff/dic:

M:\code\chromium\src\chrome\Debug>convert_dict.exe ko_KO
Reading ko_KO.aff ...
Reading ko_KO.dic ...
Serializing...
Verifying...
ERROR converting, the dictionary does not check out OK.


So ...

Comment 5 by js...@chromium.org, Aug 5 2009

To Changwoo: 

Does the normalization for Korean in Hunspell go beyond the Unicode 
normalization? 
How about  UTR 47 draft ( http://www.unicode.org/draft/reports/tr47/tr47.html )? 
Well, I can just take a look at the dictionary file to see what's there. 

mhm just upgraded Chrome's Hunspell to 1.2.8 and ICONV/OCONV should work in 
theory, but it seems that the bdic file generation failed. 


No, just NFC/NFD conversions. In the source aff.py:


import unicodedata
...
def NFD(unistr):
    return unicodedata.normalize('NFD', unistr)
...
_conv_strings = []
_conv_strings.append('ICONV 11172')
for uch in map(unichr, range(0xac00, 0xd7a3 + 1)):
    _conv_strings.append('ICONV %s %s' % (uch, NFD(uch)))
_conv_strings.append('OCONV 11172')
for uch in map(unichr, range(0xac00, 0xd7a3 + 1)):
    _conv_strings.append('OCONV %s %s' % (NFD(uch), uch))
CONV_DEFINES = '\n'.join(_conv_strings)


Comment 7 by js...@chromium.org, Aug 5 2009

Blockedon: -14756
Labels: Mstone-4
Thank you for the reply and starting the project (Korean hunspell dictionary) !

When calculating the edit distance, it might work better to go beyond the NFD. 
Perhaps, whether and how far depend on a few factors including IME/kbd layouts (and 
what's proposed in UTR 47 is not necessarily the best for spell checking). Well,  I'm 
digressing here. 

Getting back to this bug:  Due to a lot of valgrind warnings, hunspell was reverted 
back to 1.1.5. So, we can't support Korean. 

Actually, we can (as Changwoo suggested) because we can easily do NFD/NFC ourselves  
before/after calling hunspell functions. Even if we upgrade to 1.2.8, we'd better 
remove ICONV/OCONV entries (11,172 each) in the Korean dictionary and do NFD/NFC 
ourselves to save the memory consumption and improve the performance (ICU's NFD/NFC 
would be much faster than what's done inside Hunspell with 11,172 ICONV/OCONV 
entries). So, I'm removing 'blocked on 14756'. 

Nonetheless, it's too late for 3.0.  

And, another potential blocker is that we don't have a word breaker for Korean. Our 
copy of ICU has a 'rudimentary' implementation for Korean word breaking, but it's 
disabled because we don't have an open-sourceable Korean word frequency list, yet. 
That's another reason I was excited to find the Korean hunspell dictionary. 




Comment 8 by mhm@chromium.org, Sep 1 2009

Since the new hunspell has been landed, do you have any idea why our converter tool 
is not properly checking out okay? Does it have something to do with NFD/NFC?

m0@m0-desktop:~/chrome/src/sconsbuild/Debug$ ./convert_dict ko_KO
Reading ko_KO.aff ...
Reading ko_KO.dic ...
Serializing...
Verifying...
ERROR converting, the dictionary does not check out OK.

Comment 9 by hbono@chromium.org, Sep 1 2009

mhm,

Thank you for your work and sorry for my slow updates.
If I recall correctly, this Korean dictionary contained a very long word (>128 
characters) that prevented our dictionary converter from converting it when I tested 
the dictionary this March, i.e. five months before now.
Let me check it with the latest one.

Regards,

Comment 10 by mhm@chromium.org, Sep 1 2009

Ah! Okay, I have added some logging to convert_dict and it seems its failing:
"Word doesn't match, line #28992"

Which is WordList #28992:
날조할뻔하다/3

I have no idea what that means :) Korean assistance can help.
mhm,

Unfortunately, convert_dict sorts words in a ".dic" file while creating a WordList object as I noted on IRC. So, 
this number 28992 doesn't tell the line number that caused a conversion problem.

To check the word that causes this conversion problem, it seems line #172706 of this "ko.dic" has a really long 
word.

"김수한무거북이와두루미삼천갑자동방삭치치카포사리사리센타워리워리세브리캉무드셀라구름위허리케인에담벼락서생원에고양이고양이
는바둑이바둑이는돌돌이/2".

To remove this line from the dic file, we can convert it without problems.

jshin,

Even though I'm not a native Korean speaker, it doesn't seem to be one word but a phrase. Is it possible to give me 
your opinions whether or not we can remove this line?

Regards,

Comment 12 by mhm@chromium.org, Sep 1 2009

Forgot about the top thing, as hbono pointed out on IRC that the word list that we use 
is alphabetized. The line numbers don't match. He has some results that he would share 
soon.
hbono,

The long word "김수한무..." is an eastar egg, based on an old Korean comedy TV show.
It can be safely removed.

Comment 14 by mhm@chromium.org, Sep 5 2009

Status: Started
Actually, the size limit of BDict is not 128 characters but 128 bytes in UTF-8. When 
translated to syllable counts, it's ~ 43. Moreover, the current Korean dictionary 
uses NFD, which makes it even shorter (like ~14 or ~21). Anyway, if there's only one 
word longer than that, it's ok.  

BTW, fixing this is not as simple as converting the Korean dictionary to the Bdict 
format and checking that in. We also want to remove entries for NFC -> NFD conversion 
because we can do that easily/cheaply. To do that, we have to change our code 
slightly. 

Moreover, without Korean segmentation activated, it'll not be very useful. 


Comment 16 by mhm@chromium.org, Sep 11 2009

Status: Available
Labels: -Mstone-4 Mstone-X
Labels: -I18N bulkmove Feature-I18N
The Korean spellcheck dictionary for Hunspell has been developed at 
http://code.google.com/p/spellcheck-ko

There's a firefox extension for Korean spellchecking using this dictionary.     

http://forums.mozilla.or.kr/viewtopic.php?p=38615  (sorry it's in Korean)

( http://www.box.net/shared/p6ddocas8m )

The developer of firefox extension also made an extension for openoffice. 
http://www.box.net/shared/670ugnuqip

Both use the same dictionary at code.google.com

Being highly agglutinative, Korean  is likely to 'suffer' from the same 
problem (huge memory use) as Hungarian given the way Hunspell stores words 
in memory (expanding all the possible combinations of 'stem + suffices' and 
storing them all in hash).
Labels: -Area-Misc -Mstone-X Area-Internals Feature-Spellcheck
Owner: odean@chromium.org

Comment 21 by groby@chromium.org, Aug 31 2012

Cc: groby@chromium.org
Cc: -hbono@chromium.org
Labels: -Pri-2 Pri-3
Owner: ----
Owner: rouslan@chromium.org
Status: Assigned
Status: Started
Summary: Add Korean spell checking support (was: add Korean spell checking support)
Groby: Can you please review and dcommit ko.patch and ko-1-2.bdic to src/third_party/hunspell_dictionaries? I do not have full committer powers. 

I gzipped the files because of the 10MB file limit for attachments, btw.
ko.patch.gz
1.2 MB Download
ko-1-2.bdic.gz
603 KB Download
Project Member

Comment 26 by bugdroid1@chromium.org, Dec 15 2012

The following revision refers to this bug:
    http://src.chromium.org/viewvc/chrome?view=rev&revision=173254

------------------------------------------------------------------------
r173254 | rlp@chromium.org | 2012-12-15T01:38:29.429306Z

Changed paths:
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/id-ID-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/en-US-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/sl-SI-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/en-GB-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/sk-SK-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   M http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/README_ru_RU.txt?r1=173254&r2=173253&pathrev=173254
   M http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/ru_RU.aff?r1=173254&r2=173253&pathrev=173254
   M http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/fr_FR.aff?r1=173254&r2=173253&pathrev=173254
   M http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/README_fr_FR.txt?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/lt-LT-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/sv-SE-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/vi-VN-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   M http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/README_nl_NL.txt?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/lv-LV-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   M http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/nl_NL.aff?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/pt-PT-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   M http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/ru_RU.dic_delta?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/el-GR-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/fo-FO-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/nb-NO-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/ko-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   M http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/sl_SI.dic?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/cs-CZ-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   M http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/sv_SE.dic?r1=173254&r2=173253&pathrev=173254
   M http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/lv_LV.aff?r1=173254&r2=173253&pathrev=173254
   M http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/README_lv_LV.txt?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/sq-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/en-CA-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/hu-HU-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/it-IT-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/ca-ES-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   M http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/ru_RU.dic?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/he-IL-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/af-ZA-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/ro-RO-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   M http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/fr_FR.dic?r1=173254&r2=173253&pathrev=173254
   M http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/da_DK.dic?r1=173254&r2=173253&pathrev=173254
   M http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/nl_NL.dic?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/hi-IN-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/en-AU-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   M http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/README_ca_ES.txt?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/ta-IN-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/ru-RU-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/sh-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/fr-FR-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/es-ES-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   M http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/README_sv_SE.txt?r1=173254&r2=173253&pathrev=173254
   M http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/sv_SE.aff?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/bg-BG-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/de-DE-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/hr-HR-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/da-DK-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/nl-NL-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   M http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/de_DE_neu.dic_delta?r1=173254&r2=173253&pathrev=173254
   M http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/README.chromium?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/uk-UA-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   M http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/lv_LV.dic?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/pl-PL-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/pt-BR-3-0.bdic?r1=173254&r2=173253&pathrev=173254
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/sr-3-0.bdic?r1=173254&r2=173253&pathrev=173254

[Spellcheck] Updating dictionaries to most recent versions. See the respective bugs for each language for more detail.

BUG= 8397 , 8803 , 20083 , 65116 ,  104891 , 112227 , 113821 
------------------------------------------------------------------------
Project Member

Comment 27 by bugdroid1@chromium.org, Dec 15 2012

The following revision refers to this bug:
    http://src.chromium.org/viewvc/chrome?view=rev&revision=173256

------------------------------------------------------------------------
r173256 | rlp@chromium.org | 2012-12-15T01:56:04.173883Z

Changed paths:
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/README_ko.txt?r1=173256&r2=173255&pathrev=173256
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/ko.aff?r1=173256&r2=173255&pathrev=173256
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/README_ta_IN.txt?r1=173256&r2=173255&pathrev=173256
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/ta_IN.aff?r1=173256&r2=173255&pathrev=173256
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/ko.dic?r1=173256&r2=173255&pathrev=173256
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/ta_IN.dic?r1=173256&r2=173255&pathrev=173256
   D http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/ca_ES?r1=173256&r2=173255&pathrev=173256
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/sq.dic_delta?r1=173256&r2=173255&pathrev=173256
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/README_sq.txt?r1=173256&r2=173255&pathrev=173256
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/sq.aff?r1=173256&r2=173255&pathrev=173256
   D http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/sv_SE?r1=173256&r2=173255&pathrev=173256
   A http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/sq.dic?r1=173256&r2=173255&pathrev=173256

[Spellcheck] Updating some of the dictionaries missed in the last CL. Deleting some old files.

BUG= 8397 , 8803 , 20083 , 65116 ,  104891 , 112227 , 113821 

Paths modified but not in any changelist:
------------------------------------------------------------------------
Status: Fixed
Dictionary should appear in Chrome after version 26.0.1377.0.
Project Member

Comment 30 by bugdroid1@chromium.org, Mar 10 2013

Labels: -Area-Internals -Feature-Spellcheck -Feature-I18N Cr-Internals Cr-UI-I18N Cr-UI-Browser-Spellcheck
Project Member

Comment 31 by bugdroid1@chromium.org, Mar 20 2013

Labels: -Cr-UI-I18N Cr-UI-Internationalization
Components: -UI>Browser>Spellcheck UI>Browser>Language>Spellcheck

Sign in to add a comment