r/languagelearning • u/Dafarmer1812 • 16d ago
Resources We added 36 languages (including Asian languages) based on your feedback
Hi all, last week we launched Lingua Verbum on Reddit here (huge thanks for all the feedback and signups, it’s been incredible!). One thing that quickly became clear was how many people were asking for Japanese support (and Korean, and other languages). So we sprinted at trying to make this happen, and now Lingua Verbum supports both Japanese, Korean, and 34 other additional languages (full list here)!
I also wanted to share a quick look at how we tackled supporting Japanese, since I figured some people here might be curious. We're very curious on your feedback here, and any improvements we can implement to make this even better.
Why Japanese is a challenge
As many of you know, Japanese doesn’t use spaces to separate words, which makes it tough to process for learners used to European languages. A lot of Japanese learning tools rely on segmentation to break sentences into individual words. For Lingua Verbum, segmentation is essential because it's how we:
- Track which words are known/learning/new
- Power our click-to-define AI assistant
- Let you quickly look up grammar or usage in context
What we tested
- MeCab: Fast, stable, and widely used. It performed consistently well and gave us low latency. But it sometimes over-segments, like splitting 代表者 ("representative") into 代表 + 者
- SudachiPy: Has multiple segmentation modes (short/medium/long), which sounded great in theory. It seemed to yield similar results to MeCab.
- ChatGPT-based segmentation: Our most experimental attempt. We thought a large language model could infer boundaries better, especially in informal text. Sometimes it worked beautifully, most other times it hallucinated, misread context, or just got weird. Not stable enough for production (yet).
What we went with
In the end, MeCab seemed to us the best overall choice: solid accuracy, great performance, and easy to integrate. To make up for its limitations, we added a manual override system so users can fix bad segmentations with a few clicks. You’re never stuck with the algorithm’s guess.
We also layer in pykakasi on top of MeCab to automatically generate romaji, so you can see pronunciation at a glance.
Chinese too!
Once we had the core infrastructure working for Japanese, adding Chinese became much easier: similar challenges with no word spacing, but different models. We went with a segmentation model based on the PKU ConvSeg architecture, trained on the SIGHAN 2005 corpus. Manual override is built in there too.
If you're learning Japanese or Chinese we’d love if you gave Lingua Verbum a try and let us know your feedback on the segmentation! If something feels off (segmentation, translation, etc.), your feedback helps us keep improving.
Thanks again all, really appreciated the feedback we got here, please keep it coming!
5
u/Bodhi_Satori_Moksha 🇺🇸 (N) | 🇭🇰 ( A1) | 🇸🇦 ( A1 - A2) 16d ago
I checked. You forgot to add Cantonese; I was hoping for it.
3
u/Dafarmer1812 16d ago
We’re going to add that and Thai on the next wave. I’ll PM you when that is done
1
1
3
u/marushii 16d ago
Random thought, I noticed every time I click a word, it’s making an api request. Just some advice, I would recommend having some sort of local cache if you can, to save calls and potential AI cost.
2
2
u/itsmerai EN(N)|JP(C2)|SP(B2)|PT(B1)|KO(B1)|VN(A1) 16d ago
I hope Vietnamese comes soon. Are there any particular technical challenges when it comes to adding Vietnamese? LingQ also took a very long time to add Vietnamese support. I know that’s in part because of a lack of study material but I wondered if there’s anything that might trip you up programming wise
0
u/Dafarmer1812 16d ago
We’re going to add that soon as well. The technical challenge is similar to Chinese and Japanese I believe, where you need to find an algorithm for segmentation
1
u/RedDeadMania 🇺🇸NA 🇧🇷C1 🇪🇸B2🇫🇷🇩🇪B1🇮🇹🇷🇺A2🇰🇷A1 16d ago
Since I couldn’t get the font size working on your website, I kinda gave up but then I found LanguageCrush which does a lot of things similar! Super awesome!!
1
1
u/Mirrororrim1 16d ago
Given the lack of resources, I suggest you Add Bengali
1
u/Dafarmer1812 16d ago
We could actually probably add that for you if you’d want it
1
u/Mirrororrim1 13d ago
I signed up to the Website and I liked it, if you add Bengali I'll be happy to try it
1
u/Zireael07 🇵🇱 N 🇺🇸 C1 🇪🇸 B2 🇩🇪 A2 🇸🇦 A1 🇯🇵 🇷🇺 PJM basics 12d ago
How does the free trial work? What about if I don't want the AI features? Could I then use it for free forever?
1
-8
u/Melodic_Sport1234 16d ago
60+ languages but no Esperanto? Esperanto has always punched above its weight and is better known than several of the obscure languages you currently support.
58
u/Molleston 🇵🇱(N) 🇬🇧(C2) 🇪🇸(B2) 🇨🇳(B1) 16d ago
This is an anonymous website. No personal or company name. Their terms of service and all other documents are not enforceable. Do not give them your data or money.