The unique challenge India presents to natural language processing

Oct 3, 2018,
Updated Oct 8, 2018 6:44 PM IST

Among the world's fast-growing economies and one with the second largest population, the Indian market is garnering considerable interest and is on the radar of internet and software companies. There is considerable attention around developing utility applications that rely on the understanding of language to function as bots in call centers, customer services, search, virtual agents etc. across multiple channels including voice, web and social.

Bain & Company's extensive report, 'Unlocking Digital for Bharat' estimates India to have 390 million internet users today with one in five owning a smartphone. But it also recognizes the need to create solutions based on local needs and behavior as critical to improving user engagement. With online users skewed towards a young, male, urban demographic there are large parts of the population still untouched by online access. Natural Language Processing (NLP) has the potential to broaden online access to a wider share of India's population.

NLP technology development has grown significantly due to high computing GPU machines, wide internet availability and speeds, and the spread of mobile devices. Is the time right for India to embrace this? New services around text-to-speech and speech-to-text would significantly help low income, the visually challenged and differently-abled to become part of the Digital India revolution. As part of the GoogleNext Billion plan, voice search has already launched in eight Indian languages to enable consumers to use their voice for search queries.

Based on a recent survey - how chatbots are reshaping online experience - the benefit of bots that consumers pointed to was the ability to get 24-hour service (64%), followed by getting instant responses to inquiries (55%), and getting answers to simple questions (55%). But that's where things get complicated.

Language Ambiguity and Complexity

Even though English is our official language, only 10 percent of Indians speak English. Ninety percent speak languages such as Hindi, Marathi, Gujarati, Bengali, Kannada, Telugu, Tamil, to name just a few of the 29 major languages spoken in India.

NLP, a part of AI technology, is key in understanding and manipulating human language. Understanding a language means knowing words, phrases, syntactic forms and concepts and also knowing how to link those concepts together in a meaningful way. This requires extensive knowledge about the languages and the ability to interpret it. NLP certainly provides useful functionalities such as part of speech tagging, lemmatization, phrase extraction, text categorization, entity extraction, topic extraction and parsing.

Current methods for NLP are largely driven by computational statistics. These methods don't attempt to understand the text, but instead convert the text into data, then attempt to learn from patterns in that data. It is therefore hard for machines to understand human language as it is accompanied by nuances and meanings that depend on context and non-explicit information. There are thus many added challenges for interpreting Indic languages. Indic languages do not use Latin alphabets but alphabets derived from Brahmic scripts. It's not the easiest language set for NLP to understand.

Sometime there is ambiguity around certain words, where the same word in a language is pronounced differently by different people at different times and can have different meanings, depending on the context, state of mind and geographical location. NLP algorithms must figure out these differences.

As an example:

Accha! - Great! or Nice! and same word (Accha?) means - Really?

Kya baat hai! - Awesome! and (Kya baat hai?) means - What is the matter?

It's a highly complex task to resolve these kinds of ambiguities and requires lexical resources and tools for the development of disambiguation techniques.

Lack of language grammar, literature and documented standards

One of the toughest challenges today is the lack of resources about literature and grammar despite millions of native speakers using these languages. Building NLP algorithms without a basic lexical resource is highly challenging. There are rule-based methods which are language-specific but they are error prone. The Ministry of Electronics and Information Technology has taken the lead on all these efforts to represent the 22 constitutionally-recognized languages in the Unicode Standard.

Difficulty in obtaining data

Most NLP algorithms need sufficiently large collections of text with all possible permutations and combinations of meaning put into it. Discovering these word-based patterns reveal the intelligence in the text, giving better NLP performance. Add to that, documents such as legal contracts, news articles, research reports etc. which often use domain-specific discourse models, which also need to be incorporated into an NLP algorithm to enhance its performance.

Unfortunately, the size of data sets available for most Indian languages are small compared to those available for major Western languages.

Translation work arounds

With the advancement of deep learning, translation services are today much faster and more accurate than before. An alternate approach is therefore to translate the non-English language into English, pass it through the NLP engine built on English, collect the answer, and then translate it back to the non-English language. While this is one approach, it's a cumbersome process and remains difficult when translating idioms and colloquialisms.

Making progress

But there is progress. C-DAC's Graphics and Intelligence-based Script technology (GIST) lab and Technology Development for Indian Languages (TDIL) have led initiatives on creating language corpora, dictionaries and tools. IIT (Indian Institute of Technology) Mumbai has set up a Center for Indian Language Technology (CFILT) with a grant from the Department of Information Technology (DIT) to facilitate NLP research and development and has built Hindi, Marathi and Sanskrit WordNet.

NitiAyog in its #AIforall program is committed to leverage AI for economic growth and social development. The program will support AI-based developments in speech recognition, natural language processing for research, development and creation of varieties of new applications. In other words, we have some of our best brains working on solving India's NLP challenge.

We can certainly expect to see NLP-driven conversational bots in the future, reaching all corners of India and its diverse languages. It's just going to take a little longer than we might hope.

Purushottam Darshankar, Innovation and R&D Architect at Persistent Systems

Among the world's fast-growing economies and one with the second largest population, the Indian market is garnering considerable interest and is on the radar of internet and software companies. There is considerable attention around developing utility applications that rely on the understanding of language to function as bots in call centers, customer services, search, virtual agents etc. across multiple channels including voice, web and social.

Bain & Company's extensive report, 'Unlocking Digital for Bharat' estimates India to have 390 million internet users today with one in five owning a smartphone. But it also recognizes the need to create solutions based on local needs and behavior as critical to improving user engagement. With online users skewed towards a young, male, urban demographic there are large parts of the population still untouched by online access. Natural Language Processing (NLP) has the potential to broaden online access to a wider share of India's population.

NLP technology development has grown significantly due to high computing GPU machines, wide internet availability and speeds, and the spread of mobile devices. Is the time right for India to embrace this? New services around text-to-speech and speech-to-text would significantly help low income, the visually challenged and differently-abled to become part of the Digital India revolution. As part of the GoogleNext Billion plan, voice search has already launched in eight Indian languages to enable consumers to use their voice for search queries.

Based on a recent survey - how chatbots are reshaping online experience - the benefit of bots that consumers pointed to was the ability to get 24-hour service (64%), followed by getting instant responses to inquiries (55%), and getting answers to simple questions (55%). But that's where things get complicated.

Language Ambiguity and Complexity

Even though English is our official language, only 10 percent of Indians speak English. Ninety percent speak languages such as Hindi, Marathi, Gujarati, Bengali, Kannada, Telugu, Tamil, to name just a few of the 29 major languages spoken in India.

NLP, a part of AI technology, is key in understanding and manipulating human language. Understanding a language means knowing words, phrases, syntactic forms and concepts and also knowing how to link those concepts together in a meaningful way. This requires extensive knowledge about the languages and the ability to interpret it. NLP certainly provides useful functionalities such as part of speech tagging, lemmatization, phrase extraction, text categorization, entity extraction, topic extraction and parsing.

Current methods for NLP are largely driven by computational statistics. These methods don't attempt to understand the text, but instead convert the text into data, then attempt to learn from patterns in that data. It is therefore hard for machines to understand human language as it is accompanied by nuances and meanings that depend on context and non-explicit information. There are thus many added challenges for interpreting Indic languages. Indic languages do not use Latin alphabets but alphabets derived from Brahmic scripts. It's not the easiest language set for NLP to understand.

Sometime there is ambiguity around certain words, where the same word in a language is pronounced differently by different people at different times and can have different meanings, depending on the context, state of mind and geographical location. NLP algorithms must figure out these differences.

As an example:

Accha! - Great! or Nice! and same word (Accha?) means - Really?

Kya baat hai! - Awesome! and (Kya baat hai?) means - What is the matter?

It's a highly complex task to resolve these kinds of ambiguities and requires lexical resources and tools for the development of disambiguation techniques.

Lack of language grammar, literature and documented standards

One of the toughest challenges today is the lack of resources about literature and grammar despite millions of native speakers using these languages. Building NLP algorithms without a basic lexical resource is highly challenging. There are rule-based methods which are language-specific but they are error prone. The Ministry of Electronics and Information Technology has taken the lead on all these efforts to represent the 22 constitutionally-recognized languages in the Unicode Standard.

Difficulty in obtaining data

Most NLP algorithms need sufficiently large collections of text with all possible permutations and combinations of meaning put into it. Discovering these word-based patterns reveal the intelligence in the text, giving better NLP performance. Add to that, documents such as legal contracts, news articles, research reports etc. which often use domain-specific discourse models, which also need to be incorporated into an NLP algorithm to enhance its performance.

Unfortunately, the size of data sets available for most Indian languages are small compared to those available for major Western languages.

Translation work arounds

With the advancement of deep learning, translation services are today much faster and more accurate than before. An alternate approach is therefore to translate the non-English language into English, pass it through the NLP engine built on English, collect the answer, and then translate it back to the non-English language. While this is one approach, it's a cumbersome process and remains difficult when translating idioms and colloquialisms.

Making progress

But there is progress. C-DAC's Graphics and Intelligence-based Script technology (GIST) lab and Technology Development for Indian Languages (TDIL) have led initiatives on creating language corpora, dictionaries and tools. IIT (Indian Institute of Technology) Mumbai has set up a Center for Indian Language Technology (CFILT) with a grant from the Department of Information Technology (DIT) to facilitate NLP research and development and has built Hindi, Marathi and Sanskrit WordNet.

NitiAyog in its #AIforall program is committed to leverage AI for economic growth and social development. The program will support AI-based developments in speech recognition, natural language processing for research, development and creation of varieties of new applications. In other words, we have some of our best brains working on solving India's NLP challenge.

We can certainly expect to see NLP-driven conversational bots in the future, reaching all corners of India and its diverse languages. It's just going to take a little longer than we might hope.

Purushottam Darshankar, Innovation and R&D Architect at Persistent Systems

The unique challenge India presents to natural language processing

There is considerable attention around developing utility applications that rely on the understanding of language to function as bots in call centers, customer services, search, virtual agents etc.

RECOMMENDED

{{title}}