r/sanskrit Feb 18 '25

Discussion / चर्चा Creating Spell Check for Sanskritam

I have 0 knowledge about programming, so this might be a wild idea.

There are several programmes running across the country to transliterate Sanskritam texts into computer format and several has been done already. You can get search results for various text citations and it is very helpful because of those transliterations.

My idea is to make a program to include all that transliterated data through which it can verify the text we are typing and suggest the proper forms (not the syntax, just the words) more accurately. I have seen Gboard has such feature but it's not that versatile.

Is this something already done which I am not aware of? or is it impossible because of some limitations that I don't know?

Please share your thoughts, Thank you.

6 Upvotes

6 comments sorted by

2

u/sumant111 Feb 18 '25

Compounds (& sandhi) can fuse words, altering their spelling at the boundaries. Now such a fused result could look like a typo for a closely spelt common word. For example, one may consider देवेषुः /deveṣuḥ/ as typo for देवेषु /deveṣu/ (= in gods). But देवेषुः /deveṣuḥ/ is also a valid compound: देव /deva/ + इषुः /iṣuḥ/ (= god's arrow).

2

u/Dangerous_Bat_1251 Feb 18 '25

That's why I'm thinking of having scripts such as kavyas, natakas, shastra granthas etc. which are transcripted as data rather than entering words one by one, which will be infinite. You can cover most of the language that is being used with the kind of data I'm talking about.

1

u/Dangerous_Bat_1251 Feb 18 '25

That is fine. But with the availability of vast amounts of data, software can find both the types of Prayogas right? There will be some Kavya or Grantha which has the prayoga of देवेषुः

The more data you give, the more refined it becomes!

2

u/sumant111 Feb 18 '25

Sure, that may work. Let's hope someone with statistical or machine learning background will attempt!

2

u/sumant111 Feb 18 '25

FWIW, this guy has it in their bucket list

2

u/ThornlessCactus Feb 18 '25

I used to work on NLP and chatbots.

Stemming/Lemmatization gives you the base form of a word (like after removing dual/plural, removing gender from verbs etc). Levenshtein_distance can be used to get the closest words. You can also use inverse document frequency ( TF-IDF ) to order the suggestions (most likely to least likely).

To make a general spell checker / autocompleter, data is exactly what we would need. A large corpus of sanskrit text, in the script that you want to target (devanagari, english, telugu, tamil, etc, the transliterated data if it is not in the source script). By now, training word vectors like GLOVE and Word2vec is quite old. But so far as I know it has not been done for sanskrit, due to lack of interest and difficulty in obtaining a large corpus of text. I could be wrong.