Bridging Linguistic Diversity: Evaluating and Advancing AI for Indian Languages
Introduction to Language Models and Their Benchmarks
Language models (LLMs) are at the heart of modern AI, enabling machines to understand and generate human language. The effectiveness of these models is gauged through benchmarks, which are standardized tests designed to evaluate their performance across various tasks. Benchmarks play a crucial role in identifying strengths, pinpointing weaknesses, and guiding improvements in LLMs.
Key Aspects of Language Models:
- Scale: The ability to process vast amounts of data efficiently.
- Adaptability: Flexibility to perform a range of tasks from translation to summarization.
- Contextual Understanding: Comprehension of context and subtleties in language.
Benchmarks: What, Why, and How
What Are Benchmarks?
Benchmarks are standardized datasets and tasks used to assess the performance of language models. They provide a common ground for comparing different models.
Why Are They Important?
Benchmarks help in understanding how well models perform across different tasks, identifying areas for improvement, and driving the development of more capable AI systems.
How Are They Conducted?
Models are evaluated on predefined tasks using metrics such as accuracy, precision, and recall. These tasks range from sentiment analysis to natural language inference.
Key Benchmarks
GLUE (General Language Understanding Evaluation):
- Purpose: Evaluates general language understanding tasks.
- Tasks: Includes sentiment analysis, sentence similarity, and natural language inference.
- Advantages: Comprehensive evaluation of model capabilities.
- Limitations: Primarily focused on English, which limits its applicability to other languages.
SUPERGLUE:
- Purpose: Designed to challenge more advanced models beyond what GLUE offers.
- Tasks: Includes Boolean QA, commonsense reasoning, and coreference resolution.
- Advantages: Introduces more complex tasks requiring deeper understanding.
- Limitations: Resource-intensive and still centered on English.
Hellaswag:
- Purpose: Tests commonsense reasoning by predicting plausible continuations of events.
- Data Source: Derived from ActivityNet Captions and WikiHow.
- Advantages: Focuses on practical scenarios and everyday reasoning.
- Limitations: Primarily in English, specific to certain types of reasoning.
MMLU (Massive Multitask Language Understanding):
- Purpose: Evaluates the performance of models across a broad spectrum of subjects.
- Tasks: Includes questions from standardized tests and professional exams.
- Advantages: Broad coverage of subjects and real-world relevance.
- Limitations: Performance can vary significantly with small changes in test conditions, such as the order of answers or symbols.
Developing and Evaluating LLMs for Indian Languages
The Journey of IndicLLMs:
The journey of IndicLLMs began with IndicBERT in 2020, focusing on Natural Language Understanding (NLU). IndicBERT has over 400K downloads on Hugging Face, highlighting its widespread use. IndicBART followed in 2021, targeting Natural Language Generation (NLG). These models were developed with support from EkStep Foundation and Nilekani Philanthropies, despite the limited data and model scale available.
With the introduction of large open models like Llama-2 and Mistral, the focus shifted towards adapting these models for Indic languages. Initiatives like OpenHathi (Base) and Airavata (Chat) have emerged, developing models tailored to different languages. These adaptations involve extending the tokenizer and embedding layer, followed by continual pretraining using data from existing multilingual corpora like mc4, OSCAR, and Roots.
Challenges in Indic Language Models:
- Data Scarcity: Limited high-quality datasets for many Indian languages.
- Dialectal Variations: Managing diverse dialects and regional nuances.
- Technological Gaps: Need for more computational resources and standardized tools for development and evaluation.
Why Indic-Only Models Are Necessary:
Despite the capabilities of models like GPT-3.5 and GPT-4, there are specific reasons why Indic-only models are essential:
- Tokenization Efficiency: Indic languages are not efficiently represented in English tokenizers, leading to inefficiencies.
- Performance on Low-Resource Languages: English models perform well with high-resource languages but struggle with low-to-medium resource languages like Oriya, Kashmiri, and Dogri.
- Accuracy and Hallucinations: Issues like hallucinations are more pronounced in Indic languages, significantly decreasing the accuracy of responses.
Samanantar Dataset
Overview:
Samanantar is a large-scale parallel corpus collection designed to support machine translation and other NLP tasks. It contains 49.7 million sentence pairs between English and 11 Indic languages, representing two language families.
Data Collection:
The data for Samanantar was collected from various sources, including news articles, religious texts, and government documents. The process involved identifying parallel sentences, scoring their similarity, and post-processing to ensure quality.
Creation Process:
- Parallel Sentences: Identifying sentences that are translations of each other.
- Scoring Function: Using LaBSE embeddings to determine the likelihood of sentences being translation pairs.
- Post-Processing: Removing duplicates and ensuring high-quality sentence pairs.
Challenges in Data Collection:
The inherent noisiness of web-sourced data is a significant challenge. The quality of content varies, often containing unwanted content like poorly translated text. Ensuring high-quality, relevant content is crucial, which is why human verification plays a vital role in the data collection pipeline.
Sangraha Corpus: The Foundation for Indian LLMs
Components:
- Sangraha Verified: Contains 48B tokens of high-quality, human-verified web crawled content in all 22 scheduled Indian languages.
- Sangraha Synthetic: Includes 90B tokens from translations of English Wikimedia into 14 Indian languages and 72B tokens from transliterations into Roman script.
- Sangraha Unverified: Adds 24B tokens of high-quality, unverified data from existing collections like CulturaX and MADLAD-400.
IndicGLUE
Overview:
IndicGLUE focuses on core NLU tasks like sentiment analysis, NER, and QA. It covers 11 Indian languages and primarily uses machine translations for some datasets. However, it is not explicitly designed for zero-shot evaluation, which limits its applicability.
Key Tasks:
- News Category Classification: Classifying news articles into predefined categories.
- Named Entity Recognition (NER): Identifying and classifying proper nouns and entities.
- Headline Prediction: Generating headlines for given texts.
- Question Answering (QA): Answering questions based on given text passages.
IndicXTREME
Overview:
IndicXTREME is a human-supervised benchmark designed to evaluate models on nine diverse NLU tasks across 20 Indian languages. It includes 105 evaluation sets, with 52 newly created for this benchmark, ensuring high quality and relevance.
Key Features:
- Largest Monolingual Corpora: IndicCorp with 20.9B tokens across 24 languages.
- Human-Supervised Benchmark: Emphasizes human-created or human-translated datasets.
- Tasks: Covers 9 diverse NLU tasks, including classification, structure prediction, QA, and text retrieval.
- Zero-Shot Testing: Designed to test the zero-shot multilingual capabilities of pretrained language models.
Advantages Over IndicGLUE:
- Broader Coverage: Evaluates more languages and tasks.
- Higher Quality: Human supervision ensures better accuracy.
- Zero-Shot Capabilities: Tests generalization without specific training data.
OpenHathi and Airavata LLM Models
OpenHathi:
- Developed By: Sarvam AI and AI4Bharat.
- Base Model: Extended from Llama 2.
- Focus: Foundational model for Hindi.
- Key Features: Trained on diverse Hindi datasets, open source for community use.
Airavata:
- Developed By: AI4Bharat and Sarvam AI.
- Base Model: Fine-tuned from OpenHathi.
- Focus: Instruction-tuned model for assistive tasks in Hindi.
- Key Features: Uses IndicInstruct dataset, with 7B parameters, optimized for generating Hindi instructions.
Issues with Machine Translations for Indian Languages
Machine translations play a crucial role in building datasets for Indic language models, but they come with significant challenges and limitations:
Context Loss:
- Issue: Machine translations often lose the nuanced meanings and context of the original text.
- Example: Idiomatic expressions or cultural references can be inaccurately translated, leading to a loss of intended meaning.
- Impact: This affects the comprehension and relevance of the translated text, which can mislead the language model during training.
Partial Sentences:
- Issue: Translating partial sentences or phrases can lead to ambiguities and incorrect interpretations.
- Example: A phrase in English might not have a direct counterpart in an Indic language, leading to incomplete or inaccurate translations.
- Impact: This can result in fragmented or nonsensical data that negatively impacts the model's learning process.
Order and Format Changes:
- Issue: Changes in the order of words or the format of sentences during translation can significantly alter the meaning.
- Example: The structure of questions and answers can be altered, leading to inconsistencies in the data.
- Impact: This inconsistency can cause models to perform poorly, as they struggle to interpret the translated text correctly.
Bias Introduction:
- Issue: Automated translation processes can introduce or amplify biases present in the source or target languages.
- Example: Gender biases or cultural biases might be exaggerated or incorrectly represented in translations.
- Impact: These biases can skew the training data, leading
to biased language models that do not perform equitably across different user groups.
Cultural Nuances:
- Issue: Capturing the cultural context and nuances specific to Indic languages is challenging for machine translations.
- Example: Cultural references, local customs, and regional dialects might not be accurately translated.
- Impact: This can lead to misunderstandings and misinterpretations, reducing the effectiveness and relevance of the language model.
Resource Intensity:
- Issue: Ensuring high-quality translations requires significant computational and human resources.
- Example: Manual verification and correction of machine-translated data can be resource-intensive.
- Impact: The high cost and effort involved can limit the scalability and feasibility of creating large, high-quality datasets.
Addressing These Challenges
To overcome these challenges, several strategies can be employed:
Collaborative Translation Approach:
- Combining machine translation with human validation to ensure accuracy and cultural relevance.
- Involving native speakers and linguists in the translation process to maintain context and nuance.
Standardized Guidelines:
- Developing clear guidelines for translators to maintain consistency and quality across translations.
- Training translators to understand the specific requirements and nuances of NLP tasks.
Contextual Embedding Techniques:
- Using advanced embedding techniques to preserve the context and meaning of sentences during translation.
- Implementing thresholding and post-processing steps to filter out low-quality translations.
Multilingual Prompting Strategies:
- Designing prompts that are suitable for multiple languages and contexts to improve model performance.
- Utilizing few-shot learning techniques to provide models with contextually relevant examples.
Bias Mitigation:
- Conducting regular bias audits on translated datasets to identify and address potential biases.
- Ensuring datasets include diverse sources and contexts to reduce the impact of any single bias.
Resource Optimization:
- Using efficient translation tools and APIs to handle large-scale translations without compromising quality.
- Optimizing computational resources to manage the high demands of translation processes.
By implementing these strategies, we can create more accurate, culturally relevant, and effective language models for Indian languages, ensuring they are robust and equitable for all users.
Pariksha Benchmark
Challenges with Existing Multilingual Benchmarks:
-
Cross-Lingual Contamination:
- Even if the multilingual version of a benchmark is not contaminated, the original English version might be. Models can use knowledge of the English benchmark through cross-lingual transfer, making the multilingual benchmark indirectly contaminated.
-
Loss of Cultural and Linguistic Nuances:
- Direct translations of benchmarks created in English and in a Western context often lose crucial cultural and linguistic nuances. Specialized models need to be evaluated on these dimensions to ensure relevance and accuracy.
-
Unsuitability of Standard Metrics:
- Standard metrics used in many benchmarks, such as exact match and word overlap, are not suitable for Indian languages due to non-standard spellings. This can unfairly penalize a model for using slightly different spellings than those in the benchmark reference data.
Methodology:
Step-by-Step Process:
-
Curating Evaluation Prompts:
- A diverse set of evaluation prompts is curated with the help of native speakers to ensure cultural and linguistic relevance.
-
Generating Model Responses:
- Responses to the curated prompts are generated from the models under consideration, capturing a wide range of linguistic behaviors and outputs.
-
Evaluation Settings:
- The generated responses are evaluated in two settings:
- Individual Evaluation: Each response is evaluated on its own.
- Pairwise Comparison: Responses are compared against each other to determine which one is better.
- The generated responses are evaluated in two settings:
-
Constructing Leaderboards:
- Scores from the evaluations are used to construct leaderboards, providing a clear ranking of model performance.
Introduction to ELO Rating System:
The ELO rating system, widely used in competitive games like chess, measures the relative skill levels of players. In the Pariksha Benchmark, we adapt the ELO rating system to evaluate and compare the performance of AI models based on their responses to evaluation prompts. This system allows us to convert human preferences into ELO ratings, predicting the win rates between different models.
Formulas and Explanation:
1. Expected Score (EA):
- Explanation: This formula calculates the expected score for model A when compared to model B. (R_A) and (R_B) are the current ratings of models A and B, respectively. The expected score represents the probability of model A winning against model B.
2. Rating Update Formula:
- Explanation: This formula updates the rating of model A after a comparison. (R_A) is the current rating, (R_A') is the new rating, (K) is a factor that determines the sensitivity of the rating system, (S_A) is the actual score (1 for a win, 0.5 for a draw, 0 for a loss), and (E_A) is the expected score calculated using the first formula. The rating is adjusted based on the difference between the expected outcome and the actual outcome, scaled by (K).
3. Bradley-Terry Model:
- Explanation: In the context of the Bradley-Terry model, which is used to estimate the log-likelihood of the underlying ELO, (p_i) and (p_j) are the performance parameters of models (i) and (j), respectively. This model assumes a fixed but unknown pairwise win-rate and estimates the probability that model (i) will outperform model (j).
ELO Calculation Process:
Step-by-Step Process:
-
Pairwise Comparison:
- For each prompt, responses from two models are compared.
- Human evaluators or an LLM decide which response is better.
-
Expected Score Calculation:
- The expected score (E_A) is calculated for model A against model B using the first formula.
- This gives the probability of model A winning against model B.
-
Rating Update:
- After the comparison, the actual score (S_A) is determined (1 for a win, 0.5 for a draw, 0 for a loss).
- The new rating (R_A') is calculated using the second formula, updating model A’s rating based on its performance relative to the expectation.
-
Bradley-Terry Model Application:
- The Bradley-Terry model is used to estimate the fixed pairwise win-rate, ensuring that the order of comparisons does not affect the ratings.
- The probability of one model outperforming another is calculated to provide a robust comparison framework.
Individual Metrics:
-
Linguistic Acceptability:
- Measures if the text is in the correct language and grammatically correct. It is rated on a scale of 0-2.
-
Task Quality:
- Assesses if the answer is of high quality and provides useful information. It is also rated on a scale of 0-2.
-
Hallucination:
- Checks if the answer contains untrue or made-up facts. It is rated on a binary scale of 0-1.
Inter-Rater Reliability Metrics:
-
Percentage Agreement (PA):
- Calculates the percentage of items on which annotators agree, ranging from 0 (no agreement) to 1 (perfect agreement).
-
Fleiss Kappa (κ):
- Measures inter-annotator agreement, accounting for the agreement occurring by chance.
-
Kendall’s Tau:
- A correlation coefficient that measures the relationship between two columns of ranked data, used to compare leaderboards obtained through various evaluation techniques.
Higher agreement scores among human annotators compared to human-LLM pairs suggest that while GPT-4 performs well, human evaluators still provide more reliable and consistent evaluations. The variation across languages could point to specific challenges in those languages, such as syntax complexity or cultural nuances that GPT-4 might not fully grasp.
Way Forward: Developing Truly "Indian" Language Models
Vision:
The goal is to develop models that go beyond multilingual capabilities to truly understand and generate culturally and contextually relevant content for all Indian users. This involves creating models that act as digital knowledge companions, comprehending cultural idioms, historical references, regional specifics, and diverse interaction styles.
Key Strategies:
- High-Quality Data Curation: Ensuring datasets are comprehensive, diverse, and of high quality.
- Human Supervision: Leveraging language experts for data annotation and translation.
- Broad Evaluation: Developing benchmarks like IndicXTREME to evaluate a wide range of tasks across multiple languages.
- Continual Adaptation: Updating and refining models to keep pace with linguistic and cultural changes.
Conclusion
The development and evaluation of Indic language models are crucial for advancing AI in India. By focusing on comprehensive data curation, human supervision, and robust evaluation frameworks, we can create models that are not only multilingual but truly multicultural. Initiatives like IndicXTREME, OpenHathi, Airavata, and IndicMT Eval are paving the way for a future where AI can seamlessly interact with and understand the diverse linguistic landscape of India. As we continue to innovate and refine these models, we move closer to achieving truly inclusive and effective AI solutions for all Indian languages.
Top comments (0)