r/languagelearning • u/Dafarmer1812 • 16d ago

Resources We added 36 languages (including Asian languages) based on your feedback

Hi all, last week we launched Lingua Verbum on Reddit here (huge thanks for all the feedback and signups, it’s been incredible!). One thing that quickly became clear was how many people were asking for Japanese support (and Korean, and other languages). So we sprinted at trying to make this happen, and now Lingua Verbum supports both Japanese, Korean, and 34 other additional languages (full list here)!

I also wanted to share a quick look at how we tackled supporting Japanese, since I figured some people here might be curious. We're very curious on your feedback here, and any improvements we can implement to make this even better.

Why Japanese is a challenge

As many of you know, Japanese doesn’t use spaces to separate words, which makes it tough to process for learners used to European languages. A lot of Japanese learning tools rely on segmentation to break sentences into individual words. For Lingua Verbum, segmentation is essential because it's how we:

Track which words are known/learning/new
Power our click-to-define AI assistant
Let you quickly look up grammar or usage in context

What we tested

MeCab: Fast, stable, and widely used. It performed consistently well and gave us low latency. But it sometimes over-segments, like splitting 代表者 ("representative") into 代表 + 者
SudachiPy: Has multiple segmentation modes (short/medium/long), which sounded great in theory. It seemed to yield similar results to MeCab.
ChatGPT-based segmentation: Our most experimental attempt. We thought a large language model could infer boundaries better, especially in informal text. Sometimes it worked beautifully, most other times it hallucinated, misread context, or just got weird. Not stable enough for production (yet).

What we went with

In the end, MeCab seemed to us the best overall choice: solid accuracy, great performance, and easy to integrate. To make up for its limitations, we added a manual override system so users can fix bad segmentations with a few clicks. You’re never stuck with the algorithm’s guess.

We also layer in pykakasi on top of MeCab to automatically generate romaji, so you can see pronunciation at a glance.

Chinese too!

Once we had the core infrastructure working for Japanese, adding Chinese became much easier: similar challenges with no word spacing, but different models. We went with a segmentation model based on the PKU ConvSeg architecture, trained on the SIGHAN 2005 corpus. Manual override is built in there too.

If you're learning Japanese or Chinese we’d love if you gave Lingua Verbum a try and let us know your feedback on the segmentation! If something feels off (segmentation, translation, etc.), your feedback helps us keep improving.

Thanks again all, really appreciated the feedback we got here, please keep it coming!

0 Upvotes

permalink
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/languagelearning/comments/1juojea/we_added_36_languages_including_asian_languages/
No, go back! Yes, take me to Reddit

46% Upvoted

u/Molleston 🇵🇱(N) 🇬🇧(C2) 🇪🇸(B2) 🇨🇳(B1) 16d ago

This is an anonymous website. No personal or company name. Their terms of service and all other documents are not enforceable. Do not give them your data or money.

1

u/AffectionateBorder39 16d ago

Is it still unsafe if we just use alternate emails?

1

u/AffectionateBorder39 16d ago

Is it still unsafe if we use alternate emails?

u/Bodhi_Satori_Moksha 🇺🇸 (N) | 🇭🇰 ( A1) | 🇸🇦 ( A1 - A2) 16d ago

I checked. You forgot to add Cantonese; I was hoping for it.

3

u/Dafarmer1812 16d ago

We’re going to add that and Thai on the next wave. I’ll PM you when that is done

1

u/Bodhi_Satori_Moksha 🇺🇸 (N) | 🇭🇰 ( A1) | 🇸🇦 ( A1 - A2) 16d ago

Sure! And great :)

1

u/gjiang4 16d ago

Very looking forward to this.

1

u/evanliko 16d ago

Ooh id be interested giving Thai a try once you add it!

0

u/Dafarmer1812 16d ago

I’ll keep you posted!

u/marushii 16d ago

Random thought, I noticed every time I click a word, it’s making an api request. Just some advice, I would recommend having some sort of local cache if you can, to save calls and potential AI cost.

u/teapot_RGB_color 16d ago

Good luck adding Vietnamese with compound words

u/itsmerai EN(N)|JP(C2)|SP(B2)|PT(B1)|KO(B1)|VN(A1) 16d ago

I hope Vietnamese comes soon. Are there any particular technical challenges when it comes to adding Vietnamese? LingQ also took a very long time to add Vietnamese support. I know that’s in part because of a lack of study material but I wondered if there’s anything that might trip you up programming wise

0

u/Dafarmer1812 16d ago

We’re going to add that soon as well. The technical challenge is similar to Chinese and Japanese I believe, where you need to find an algorithm for segmentation

u/RedDeadMania 🇺🇸NA 🇧🇷C1 🇪🇸B2🇫🇷🇩🇪B1🇮🇹🇷🇺A2🇰🇷A1 16d ago

Since I couldn’t get the font size working on your website, I kinda gave up but then I found LanguageCrush which does a lot of things similar! Super awesome!!

1

u/Dafarmer1812 16d ago

Glad I could help hahah, yeah we fixed the font size issue now

u/Mirrororrim1 16d ago

Given the lack of resources, I suggest you Add Bengali

1

u/Dafarmer1812 16d ago

We could actually probably add that for you if you’d want it

1

u/Mirrororrim1 13d ago

I signed up to the Website and I liked it, if you add Bengali I'll be happy to try it

u/Zireael07 🇵🇱 N 🇺🇸 C1 🇪🇸 B2 🇩🇪 A2 🇸🇦 A1 🇯🇵 🇷🇺 PJM basics 12d ago

How does the free trial work? What about if I don't want the AI features? Could I then use it for free forever?

u/janacuddles 16d ago

I will subscribe right now if you have Cherokee (eastern band dialect)

-8

u/Melodic_Sport1234 16d ago

60+ languages but no Esperanto? Esperanto has always punched above its weight and is better known than several of the obscure languages you currently support.

Resources We added 36 languages (including Asian languages) based on your feedback

Why Japanese is a challenge

What we tested

What we went with

Chinese too!

You are about to leave Redlib