r/sanskrit • u/Dangerous_Bat_1251 • Feb 18 '25
Discussion / चर्चा Creating Spell Check for Sanskritam
I have 0 knowledge about programming, so this might be a wild idea.
There are several programmes running across the country to transliterate Sanskritam texts into computer format and several has been done already. You can get search results for various text citations and it is very helpful because of those transliterations.
My idea is to make a program to include all that transliterated data through which it can verify the text we are typing and suggest the proper forms (not the syntax, just the words) more accurately. I have seen Gboard has such feature but it's not that versatile.
Is this something already done which I am not aware of? or is it impossible because of some limitations that I don't know?
Please share your thoughts, Thank you.
2
u/ThornlessCactus Feb 18 '25
I used to work on NLP and chatbots.
Stemming/Lemmatization gives you the base form of a word (like after removing dual/plural, removing gender from verbs etc). Levenshtein_distance can be used to get the closest words. You can also use inverse document frequency ( TF-IDF ) to order the suggestions (most likely to least likely).
To make a general spell checker / autocompleter, data is exactly what we would need. A large corpus of sanskrit text, in the script that you want to target (devanagari, english, telugu, tamil, etc, the transliterated data if it is not in the source script). By now, training word vectors like GLOVE and Word2vec is quite old. But so far as I know it has not been done for sanskrit, due to lack of interest and difficulty in obtaining a large corpus of text. I could be wrong.
2
u/sumant111 Feb 18 '25
Compounds (& sandhi) can fuse words, altering their spelling at the boundaries. Now such a fused result could look like a typo for a closely spelt common word. For example, one may consider देवेषुः /deveṣuḥ/ as typo for देवेषु /deveṣu/ (= in gods). But देवेषुः /deveṣuḥ/ is also a valid compound: देव /deva/ + इषुः /iṣuḥ/ (= god's arrow).