Congrats Bo-Han Lu on the fantastic presentation of our work at LREC-COLING 2024! A great collaboration with Richard Tzong-Han Tsai. We translated Hokkien-Mandarin/English using LLaMA 2-7B by standardizing the Han writing script and evaluating with GPT-4. Enhancing Hokkien Dual Translation by Exploring and Standardizing of Four Writing Systems https://lnkd.in/gjWPw7fT Bo-Han Lu, Yi-Hsuan Lin, En-Shiun Annie Lee, Richard Tzong-Han Tsai Machine translation focuses mainly on high-resource languages (HRLs), while low-resource languages (LRLs) like Hokkien are relatively under-explored. The study aims to address this gap by developing a dual translation model between Hokkien and both Traditional Mandarin Chinese and English. We employ a pre-trained LLaMA 2-7B model specialized in Traditional Mandarin Chinese to leverage the orthographic similarities between Hokkien Han and Traditional Mandarin Chinese. Our comprehensive experiments involve translation tasks across various writing systems of Hokkien as well as between Hokkien and other HRLs. We find that the use of a limited monolingual corpus still further improves the model's Hokkien capabilities. We then utilize our translation model to standardize all Hokkien writing systems into Hokkien Han, resulting in further performance improvements. Additionally, we introduce an evaluation method incorporating back-translation and GPT-4 to ensure reliable translation quality assessment even for LRLs. The study contributes to narrowing the resource gap for Hokkien and empirically investigates the advantages and limitations of pre-training and fine-tuning based on LLaMA 2.
Annie Lee’s Post
More Relevant Posts
-
Congratulations to E. David Guzmán Ramírez on being featured by the Master of Science in Applied Computing (MScAC). It was great to experience NAACL and the richness of South America languages with you and @eric khiu in Mexico City.
We are thrilled to celebrate the upcoming graduation of our Master of Science in Applied Computing (MScAC) students next week! Our talented graduates have been on an incredible journey making a large impact on the tech innovation space and society. We had the privilege of speaking with six outstanding MScAC grads who have shared their unique program experiences, passions and motivations that drove them to excel at MScAC. Featured MScAC Grad Spotlights: Keerat Kaur Guliani, Sarah Hindawi, Adeem Jassani, Horus L., E. David Guzmán Ramírez, Jason Tang Read more: https://bit.ly/mscacgrad24 #UofTGrad24 #MScAC #UofTCompSci
To view or add a comment, sign in
-
-
UPDATED NEW TIME TOMORORW Birds of a Feather event on Low Resource Languages (LRLs) hosted at the hashtag #NAACL2024 conference in Mexico City 🇲🇽. Join us at Don Julián, either in person or virtually, on Wednesday, June 18th, from 2 - 3 pm (Central Standard Time) as we delve into the profound significance of hashtag #LRLs. LINK: https://lnkd.in/gWXxQT5x Organizers: Anne En-Shiun Lee is an assistant professor (status only) at the Department of Computer Science at the University of Toronto and, assistant professor at Ontario Tech University. #text. E. David Guzmán Ramírez - MScAC alumni, University of Toronto Eric Khiu - Undergraduate Math & Computer Science student at University of Michigan, NLP Research Fellow at University of Toronto Aditya Khan - Undergraduate Data Science & Statistics Student, University of Toronto Mason Shipton - Undergraduate Computer Science Student, Ontario Tech University #birdsofafeather hashtag #event hashtag #naacl2024 hashtag #LRL hashtag #language hashtag #nlp hashtag #community hashtag #machinelearning hashtag #research hashtag #ai
To view or add a comment, sign in
-
-
🌎 Calling all language enthusiasts and professionals! Prepare to embark on an extraordinary journey at the highly anticipated Birds of a Feather event on Low Resource Languages (LRLs) hosted at the #NAACL2024 conference in Mexico City 🇲🇽. Join us at Don Julián, either in person or virtually, on Wednesday, June 18th, from 2 - 3 pm (Central Standard Time) as we delve into the profound significance of #LRLs. In our interconnected world, where communication bridges gaps and fosters understanding, it is crucial to recognize and celebrate the diverse linguistic tapestry that enriches our #global #community. LRLs, spoken by marginalized communities, carry immense cultural, historical, and social value. By shining a spotlight on LRLs, we aim to raise awareness, address the challenges faced by these languages, and explore innovative solutions. Together, let's embrace the power of #inclusivity, empower #underrepresented #languages, and forge a path toward a more #equitable #linguistic landscape 🗣 💡 ✨ Don't miss out on this opportunity to connect, learn, and be inspired at the Birds of a Feather event on Low Resource Languages! Link: https://2024.naacl.org/ Organizers: Anne En-Shiun Lee is an assistant professor (status only) at the Department of Computer Science at the University of Toronto and, since fall 2023, assistant professor at Ontario Tech University. She received her Ph.D. from the University of Waterloo from the Centre of Pattern Intelligence and Machine Intelligence. She was a research scientist at VerticalScope Inc. and Stradigi AI, as well as a visiting researcher at the Fields Institute and The Chinese University of Hong Kong. Professor Lee's passion is in finding patterns in #society and in #nature. More notably, she developed unsupervised algorithms using patterns applied to both #biosequences and #text. E. David Guzmán Ramírez - MScAC alumni, University of Toronto Eric Khiu - Undergraduate Math & Computer Science student at University of Michigan, NLP Research Fellow at University of Toronto Aditya Khan - Undergraduate Data Science & Statistics Student, University of Toronto Mason Shipton - Undergraduate Computer Science Student, Ontario Tech University #birdsofafeather #event #naacl2024 #LRL #language #nlp #community #machinelearning #research #ai
To view or add a comment, sign in
-
-
NAACL has officially started! Hope to see you there at our 2 posters. Tuesday TBD (3:30-5p.m.) A Reproducibility Study on Quantifying Language Similarity: The Impact of Missing Values in the URIEL Knowledge Base, Hasti Toossi Eric Khiu Brady Huai Jinyu L. A.Seza Dogruoz Wednesday at 11-12:30 a.m. Unlocking Parameter-Efficient Fine-Tuning for Low-Resource Language Translation, Tong Su Xin Peng E. David Guzmán Ramírez Sarubi Thillainathan Surangika Ranathunga Congrats to all the junior researchers on their hard work and thank you to all the collaborators for your support in training the next generation of brilliant minds.
To view or add a comment, sign in
-
Names do Matter! Wu and Ge’s research found that people with complex names had a 10% lower chance of getting an academic job over the next year. According to the paper’s abstract, “analysis on two recent cohorts of economics Ph.D. job candidates shows that those with difficult-to-pronounce names are less likely to obtain an academic or tenure-track position and are placed at institutions with lower research productivity.” https://lnkd.in/gFXP66Qt
To view or add a comment, sign in
-
way to go team!
I'm very happy to share our new benchmark, IrokoBench--a human translated benchmark dataset for 16 African languages covering: - natural language inference (AfriXNLI) - Maths reasoning (AfriMGSM) - Multi-choice QA (AfriMMLU) Paper: https://lnkd.in/e_arMaYg Data: https://lnkd.in/eFwkXYfj Project funded by Lacuna Fund with Masakhane NLP We cover 18 languages in our benchmark, 16 native African languages (amh, ewe, hau, ibo, kin, lin, lug, orm, sna, sot, swa, twi, wol, xho, yor, & zul) , and two European languages (eng and fra) translated from MGSM, MMLU and XNLI datasets. We provide zero/few-shot evaluation of the performance of 14 LLMs (open weight models and closed models) in two settings leveraging lm-eval: 1) in-language and 2) translate-test (where test sets were automatically translated to English using NLLB-200-3B). GPT-4o is the best across all tasks for native African languages, however, the performance is worse for eng & fra, where GPT-4-Turbo is more than +9.0 better. Aya-101 is the best open-model, but in the translate-test setting LLaMa 3 70B is better since it's more English-centric. In few-shot evaluation, LLaMa 3 70B significantly benefits from few-shot examples for AfriMMLU and AfriXNLI but did not for AfriMGSM since it's only able to reason effectively on maths in English. GPT-4o consistently improves in performance with additional few-shot examples. This is a great collaboration with many authors Jessica Ojo Israel Abebe Azime Jesujoba Alabi Millicent Ochieng Sara Hooker Andiswa Bukula Annie Lee Happy Buzaaba Blessing Sibanda Jonathan Mukiibi Salomey Osei Salomon Kabongo KABENAMUALU Foutse Yuehgoh Rooweither Mabuya Shamsuddeen Hassan Muhammad, PhD sokhar samb Mmasibidi Setaka Lolwethu Ndolela Nkiruka Odu Tadesse Kebede Xuanli He Pontus Stenetorp A big thank you to OpenAI , Cohere For AI and Oracle for the compute/API credits
To view or add a comment, sign in
-
CAIO at SpassMed
1moI found that Llama 3 is amazing and performed better than expected. Cheers and look forward to hearing more from you Annie Lee 🙏