Mitodru Niyogi

Bengaluru, Karnataka, India Contact Info
1K followers 500+ connections

Join to view profile

About

Hire me, if you want to train LLMs from scratch and also to fine-tune the existing ones!…

Activity

Join now to see all activity

Experience & Education

  • Stealth Startup

View Mitodru’s full experience

See their title, tenure and more.

or

By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement, Privacy Policy, and Cookie Policy.

Licenses & Certifications

Publications

  • PARAMANU-AYN: An Efficient Novel Generative and Instruction-tuned Language Model for Indian Legal Case Documents

    ArXiV

    In this paper, we present PARAMANU-AYN, a language model based exclusively on case documents of the Supreme Court of India, the Constitution of India, and the Indian Penal Code. The novel Auto Regressive (AR) decoder based model is pretrained from scratch at a context size of 8192. We evaluated our pretrained legal model on perplexity metrics. We also instruction-tuned our pretrained model on a set of 10,763 instructions covering various legal tasks such as legal reasoning, judgement…

    In this paper, we present PARAMANU-AYN, a language model based exclusively on case documents of the Supreme Court of India, the Constitution of India, and the Indian Penal Code. The novel Auto Regressive (AR) decoder based model is pretrained from scratch at a context size of 8192. We evaluated our pretrained legal model on perplexity metrics. We also instruction-tuned our pretrained model on a set of 10,763 instructions covering various legal tasks such as legal reasoning, judgement explanation, legal clause generation, legal drafting, legal contract drafting, case summarization, constitutional question-answering, etc. We also evaluated the responses of prompts for instruction-tuned models by GPT-3.5-Turbo on clarity, relevance, completeness, and legal reasoning metrics in a scale of 10. Our model can be run on CPU and achieved 42.46 tokens/sec CPU inference speed. We found that our models, despite not being pretrained on legal books, various legal contracts, and legal documents, were able to learn the domain knowledge required for drafting various legal contracts and legal clauses, and generalize to draft legal contracts and legal clauses with limited instruction tuning. Hence, we conclude that for a strong domain-specialized generative language model (such as legal), very large amounts of data are not required to develop models from scratch. We believe that this work is the first attempt to make a dedicated generative legal language model from scratch for Indian Supreme Court jurisdiction or in legal NLP overall. We plan to release our Paramanu-Ayn model at this https://www.bharatgpts.com

    Other authors
    See publication
  • Paramanu: A Family of Novel Efficient Indic Generative Foundation Language Models

    ArXiV

    We present Gyan AI Paramanu ("atom"), a family of novel language models for Indian languages. It is a collection of auto-regressive monolingual, bilingual, and multilingual Indic language models pretrained from scratch on a single GPU for 10 Indian languages (Assamese, Bangla, Hindi, Konkani, Maithili, Marathi, Odia, Sanskrit, Tamil, Telugu) across 5 scripts (Bangla, Devanagari, Odia, Tamil, Telugu) of varying sizes ranging from 13.29M to 367.5M.The models are pretrained with a context size of…

    We present Gyan AI Paramanu ("atom"), a family of novel language models for Indian languages. It is a collection of auto-regressive monolingual, bilingual, and multilingual Indic language models pretrained from scratch on a single GPU for 10 Indian languages (Assamese, Bangla, Hindi, Konkani, Maithili, Marathi, Odia, Sanskrit, Tamil, Telugu) across 5 scripts (Bangla, Devanagari, Odia, Tamil, Telugu) of varying sizes ranging from 13.29M to 367.5M.The models are pretrained with a context size of 1024 on a single GPU. The models are very efficient, small, fast, and powerful. We have also developed an efficient most advanced Indic tokenizer that can even tokenize unseen languages. In order to avoid the "curse of multi-linguality" in our multilingual mParamanu model, we pretrained on comparable corpora by typological grouping using the same script. We performed human evaluation of our pretrained models for open end text generation on grammar, coherence, creativity, and factuality metrics for Bangla, Hindi, and Sanskrit. Our Bangla, Hindi, and Sanskrit models outperformed GPT-3.5-Turbo (ChatGPT), Bloom 7B, LLaMa-2 7B, OPT 6.7B, GPT-J 6B, GPTNeo 1.3B, GPT2-XL large language models (LLMs) by a large margin despite being smaller in size by 66 to 20 times compared to standard 7B LLMs. To run inference on our pretrained models, CPU is enough, and GPU is not needed. We also instruction-tuned our pretrained Bangla, Hindi, Marathi, Tamil, and Telugu models on 23k instructions in respective languages. Our pretrained and instruction-tuned models which are first of its kind, most powerful efficient small generative language models ever developed for Indic languages, and the various results lead to the conclusion that high quality generative language models are possible without high amount of compute power and humongous number of parameters.

    Other authors
    See publication
  • Neural Models for Source Code Synthesis and Completion

    Heidelberg University ArXiV

    Natural language (NL) to code suggestion systems assist developers in Integrated Development Environments (IDEs) by translating NL utterances into compilable code snippet. The current approaches mainly involve hard-coded, rule-based systems based on semantic parsing. These systems make heavy use of hand-crafted rules that map patterns in NL or elements in its syntax parse tree to various query constructs and can only work on a limited subset of NL with a restricted NL syntax. These systems are…

    Natural language (NL) to code suggestion systems assist developers in Integrated Development Environments (IDEs) by translating NL utterances into compilable code snippet. The current approaches mainly involve hard-coded, rule-based systems based on semantic parsing. These systems make heavy use of hand-crafted rules that map patterns in NL or elements in its syntax parse tree to various query constructs and can only work on a limited subset of NL with a restricted NL syntax. These systems are unable to extract semantic information from the coding intents of the developer, and often fail to infer types, names, and the context of the source code to get accurate system-level code suggestions. In this master thesis, we present sequence-to-sequence deep learning models and training paradigms to map NL to general-purpose programming languages that can assist users with suggestions of source code snippets, given a NL intent, and also extend auto-completion functionality of the source code to users while they are writing source code. The developed architecture incorporates contextual awareness into neural models which generate source code tokens directly instead of generating parse trees/abstract meaning representations from the source code and converting them back to source code. The proposed pretraining strategy and the data augmentation techniques improve the performance of the proposed architecture. The proposed architecture has been found to exceed the performance of a neural semantic parser, TranX, based on the BLEU-4 metric by 10.82%. Thereafter, a finer analysis for the parsable code translations from the NL intent for CoNaLA challenge was introduced. The proposed system is bidirectional as it can be also used to generate NL code documentation given source code. Lastly, a RoBERTa masked language model for Python was proposed to extend the developed system for code completion.

    See publication
  • Learning Multilingual Embeddings for Cross-Lingual Information Retrieval in the Presence of Topically Aligned Corpora

    CoRR ArXiV

    Cross-lingual information retrieval is a challenging task in the absence of aligned parallel corpora. In this paper, we address this problem by considering topically aligned corpora designed for evaluating an IR setup. To emphasize, we neither use any sentence-aligned corpora or document-aligned corpora, nor do we use any language specific resources such as dictionary, thesaurus, or grammar rules. Instead, we use an embedding into a common space and learn word correspondences directly from…

    Cross-lingual information retrieval is a challenging task in the absence of aligned parallel corpora. In this paper, we address this problem by considering topically aligned corpora designed for evaluating an IR setup. To emphasize, we neither use any sentence-aligned corpora or document-aligned corpora, nor do we use any language specific resources such as dictionary, thesaurus, or grammar rules. Instead, we use an embedding into a common space and learn word correspondences directly from there. We test our proposed approach for bilingual IR on standard FIRE datasets for Bangla, Hindi and English. The proposed method is superior to the state-of-the-art method not only for IR evaluation measures but also in terms of time requirements. We extend our method successfully to the trilingual setting.

    Other authors
    See publication
  • Discovering conversational topics and emotions from Demonetization tweets in India

    Springer

    In Proceedings of the 2017 International Conference on Computational Intelligence: Theories, Applications and Future Directions (ICCI 2017), IIT Kanpur, India

    Other authors
    • Asim K. Pal
    See publication
  • IR-IITBHU at TREC 2016 Open Search Track: Retrieving documents using Divergence From Randomness model in Terrier

    NIST Special Publication: SP 500-321

    The Twenty-Fifth Text REtrieval Conference Proceedings (TREC 2016), Gaithersburg, Maryland.

    Other authors
    • sukomal pal
    See publication

Courses

  • C Programming & Fundamentals

    -

  • Computer Graphics

    -

  • Computer Networks

    -

  • Cryptography & Network Security

    -

  • Cyber law & Security

    -

  • DBMS

    -

  • Data Mining & Data Warehousing

    -

  • Data Structures & Algorithms

    -

  • Design & Analysis of Algorithms

    -

  • Distributed Computing Systems

    -

  • E-commerce

    -

  • Engineering Economics

    -

  • Formal Languages & Automata Theory

    -

  • Image Processing

    -

  • Industrial Management

    -

  • Information Theory & Coding

    -

  • Internetworking

    -

  • Multimedia Systems

    -

  • Object Oriented Programming

    -

  • Operating System

    -

  • Operation Research

    -

  • Web Technology

    -

Honors & Awards

  • 4th All Bengal Mathematics Talent Search Exam

    JMMC RESEARCH FOUNDATION

    Ranked 3rd in All Bengal Mathematics Talent Search Exam 2011 conducted by JMMC

  • All India Camel Colour Contest (State Level)

    Camlin Limited

    Awarded All India Camel Colour Contest Prize at state level thrice in a row

  • Best All-rounder

    Maria's Day School

    Awarded award in class 10 for interdisciplinary achievements in academics,co-curricular activities and sports.

  • Best Speaker

    Council for the Indian School Certificate Examinations

    Best Speaker at Secondary School in class X and Runners up in Frank Anthony Memorial Debate(City Round)

  • Bivakar(5th year) Distinction in Painting

    Bangiya Sangeet Parishad, Rabindra Bharati University, Kolkata

    Fifth year qualified in painting with triple distinction awarded by Bangiya Sangeeet Parishad

  • Limca Quiz City Round,Bengal ALSOC, &TTIS School Quizzes

    -

    Ranked 3rd in Limca Quiz City Round & top 8 in Bengal ALSOC, TTIS School Quizzes

Languages

  • English

    Native or bilingual proficiency

  • Bengali

    Native or bilingual proficiency

  • Hindi

    Professional working proficiency

  • German

    Limited working proficiency

More activity by Mitodru

View Mitodru’s full profile

  • See who you know in common
  • Get introduced
  • Contact Mitodru directly
Join to view full profile

Other similar profiles

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More

Add new skills with these courses