Skip to main content

Showing 1–32 of 32 results for author: Adewumi, T

  1. arXiv:2406.19097  [pdf, other

    cs.CL

    Fairness and Bias in Multimodal AI: A Survey

    Authors: Tosin Adewumi, Lama Alkhaled, Namrata Gurung, Goya van Boven, Irene Pagliai

    Abstract: The importance of addressing fairness and bias in artificial intelligence (AI) systems cannot be over-emphasized. Mainstream media has been awashed with news of incidents around stereotypes and bias in many of these systems in recent years. In this survey, we fill a gap with regards to the minimal study of fairness and bias in Large Multimodal Models (LMMs) compared to Large Language Models (LLMs)… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: 8 pages

  2. arXiv:2406.11727  [pdf, ps, other

    eess.AS cs.CL

    1000 African Voices: Advancing inclusive multi-speaker multi-accent speech synthesis

    Authors: Sewade Ogun, Abraham T. Owodunni, Tobi Olatunji, Eniola Alese, Babatunde Oladimeji, Tejumade Afonja, Kayode Olaleye, Naome A. Etori, Tosin Adewumi

    Abstract: Recent advances in speech synthesis have enabled many useful applications like audio directions in Google Maps, screen readers, and automated content generation on platforms like TikTok. However, these systems are mostly dominated by voices sourced from data-rich geographies with personas representative of their source data. Although 3000 of the world's languages are domiciled in Africa, African v… ▽ More

    Submitted 27 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech 2024

  3. arXiv:2404.04838  [pdf, other

    cs.CL

    Data Bias According to Bipol: Men are Naturally Right and It is the Role of Women to Follow Their Lead

    Authors: Irene Pagliai, Goya van Boven, Tosin Adewumi, Lama Alkhaled, Namrata Gurung, Isabella Södergren, Elisa Barney

    Abstract: We introduce new large labeled datasets on bias in 3 languages and show in experiments that bias exists in all 10 datasets of 5 languages evaluated, including benchmark datasets on the English GLUE/SuperGLUE leaderboards. The 3 new languages give a total of almost 6 million labeled samples and we benchmark on these datasets using SotA multilingual pretrained models: mT5 and mBERT. The challenge of… ▽ More

    Submitted 7 April, 2024; originally announced April 2024.

    Comments: 11 pages, 6 figures

  4. arXiv:2404.04631  [pdf, other

    cs.CL

    On the Limitations of Large Language Models (LLMs): False Attribution

    Authors: Tosin Adewumi, Nudrat Habib, Lama Alkhaled, Elisa Barney

    Abstract: In this work, we provide insight into one important limitation of large language models (LLMs), i.e. false attribution, and introduce a new hallucination metric - Simple Hallucination Index (SHI). The task of automatic author attribution for relatively small chunks of text is an important NLP task but can be challenging. We empirically evaluate the power of 3 open SotA LLMs in zero-shot setting (L… ▽ More

    Submitted 6 April, 2024; originally announced April 2024.

    Comments: 8 pages, 5 figures

  5. arXiv:2404.03486  [pdf, other

    cs.CL

    Generative AI and Teachers -- For Us or Against Us? A Case Study

    Authors: Jenny Pettersson, Elias Hult, Tim Eriksson, Tosin Adewumi

    Abstract: We present insightful results of a survey on the adoption of generative artificial intelligence (GenAI) by university teachers in their teaching activities. The transformation of education by GenAI, particularly large language models (LLMs), has been presenting both opportunities and challenges, including cheating by students. We prepared the online survey according to best practices and the quest… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

    Comments: 7 pages, 3 figures

  6. arXiv:2404.00399  [pdf, other

    cs.CL cs.AI cs.LG

    Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order

    Authors: Taishi Nakamura, Mayank Mishra, Simone Tedeschi, Yekun Chai, Jason T Stillerman, Felix Friedrich, Prateek Yadav, Tanmay Laud, Vu Minh Chien, Terry Yue Zhuo, Diganta Misra, Ben Bogin, Xuan-Son Vu, Marzena Karpinska, Arnav Varma Dantuluri, Wojciech Kusa, Tommaso Furlanello, Rio Yokota, Niklas Muennighoff, Suhas Pai, Tosin Adewumi, Veronika Laippala, Xiaozhe Yao, Adalberto Junior, Alpay Ariyak , et al. (20 additional authors not shown)

    Abstract: Pretrained language models underpin several AI applications, but their high computational cost for training limits accessibility. Initiatives such as BLOOM and StarCoder aim to democratize access to pretrained models for collaborative community development. However, such existing models face challenges: limited multilingual capabilities, continual pretraining causing catastrophic forgetting, where… ▽ More

    Submitted 23 April, 2024; v1 submitted 30 March, 2024; originally announced April 2024.

    Comments: Preprint

  7. arXiv:2402.00453  [pdf, other

    cs.CV cs.CL

    Instruction Makes a Difference

    Authors: Tosin Adewumi, Nudrat Habib, Lama Alkhaled, Elisa Barney

    Abstract: We introduce Instruction Document Visual Question Answering (iDocVQA) dataset and Large Language Document (LLaDoc) model, for training Language-Vision (LV) models for document analysis and predictions on document images, respectively. Usually, deep neural networks for the DocVQA task are trained on datasets lacking instructions. We show that using instruction-following datasets improves performanc… ▽ More

    Submitted 13 June, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

    Comments: Accepted at the 16th IAPR International Workshop On Document Analysis Systems (DAS)

  8. arXiv:2312.09801  [pdf, other

    cs.CL

    ProCoT: Stimulating Critical Thinking and Writing of Students through Engagement with Large Language Models (LLMs)

    Authors: Tosin Adewumi, Lama Alkhaled, Claudia Buck, Sergio Hernandez, Saga Brilioth, Mkpe Kekung, Yelvin Ragimov, Elisa Barney

    Abstract: We introduce a novel writing method called Probing Chain-of-Thought (ProCoT), which potentially prevents students from cheating using a Large Language Model (LLM), such as ChatGPT, while enhancing their active learning. LLMs have disrupted education and many other fields. For fear of students cheating, many have resorted to banning their use. These LLMs are also known for hallucinations. We conduc… ▽ More

    Submitted 1 May, 2024; v1 submitted 15 December, 2023; originally announced December 2023.

    Comments: 8 pages, 4 figures

  9. arXiv:2311.09828  [pdf, other

    cs.CL

    AfriMTE and AfriCOMET: Enhancing COMET to Embrace Under-resourced African Languages

    Authors: Jiayi Wang, David Ifeoluwa Adelani, Sweta Agrawal, Marek Masiak, Ricardo Rei, Eleftheria Briakou, Marine Carpuat, Xuanli He, Sofia Bourhim, Andiswa Bukula, Muhidin Mohamed, Temitayo Olatoye, Tosin Adewumi, Hamam Mokayed, Christine Mwase, Wangui Kimotho, Foutse Yuehgoh, Anuoluwapo Aremu, Jessica Ojo, Shamsuddeen Hassan Muhammad, Salomey Osei, Abdul-Hakeem Omotayo, Chiamaka Chukwuneke, Perez Ogayo, Oumaima Hourrane , et al. (33 additional authors not shown)

    Abstract: Despite the recent progress on scaling multilingual machine translation (MT) to several under-resourced African languages, accurately measuring this progress remains challenging, since evaluation is often performed on n-gram matching metrics such as BLEU, which typically show a weaker correlation with human judgments. Learned metrics such as COMET have higher correlation; however, the lack of eval… ▽ More

    Submitted 23 April, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

    Comments: Accepted by NAACL 2024

  10. arXiv:2304.12847  [pdf, other

    cs.CL

    NLP-LTU at SemEval-2023 Task 10: The Impact of Data Augmentation and Semi-Supervised Learning Techniques on Text Classification Performance on an Imbalanced Dataset

    Authors: Sana Sabah Al-Azzawi, György Kovács, Filip Nilsson, Tosin Adewumi, Marcus Liwicki

    Abstract: In this paper, we propose a methodology for task 10 of SemEval23, focusing on detecting and classifying online sexism in social media posts. The task is tackling a serious issue, as detecting harmful content on social media platforms is crucial for mitigating the harm of these posts on users. Our solution for this task is based on an ensemble of fine-tuned transformer-based models (BERTweet, RoBER… ▽ More

    Submitted 25 April, 2023; originally announced April 2023.

    Comments: 6 pages, 5 figures , This paper has beed accepted in SemEval workshop at ACL 2023 conference

  11. arXiv:2304.06459  [pdf, other

    cs.CL cs.AI

    Masakhane-Afrisenti at SemEval-2023 Task 12: Sentiment Analysis using Afro-centric Language Models and Adapters for Low-resource African Languages

    Authors: Israel Abebe Azime, Sana Sabah Al-Azzawi, Atnafu Lambebo Tonja, Iyanuoluwa Shode, Jesujoba Alabi, Ayodele Awokoya, Mardiyyah Oduwole, Tosin Adewumi, Samuel Fanijo, Oyinkansola Awosan, Oreen Yousuf

    Abstract: AfriSenti-SemEval Shared Task 12 of SemEval-2023. The task aims to perform monolingual sentiment classification (sub-task A) for 12 African languages, multilingual sentiment classification (sub-task B), and zero-shot sentiment classification (task C). For sub-task A, we conducted experiments using classical machine learning classifiers, Afro-centric language models, and language-specific models. F… ▽ More

    Submitted 13 April, 2023; originally announced April 2023.

    Comments: SemEval 2023

  12. arXiv:2304.04029  [pdf, other

    cs.CL

    Bipol: A Novel Multi-Axes Bias Evaluation Metric with Explainability for NLP

    Authors: Lama Alkhaled, Tosin Adewumi, Sana Sabah Sabry

    Abstract: We introduce bipol, a new metric with explainability, for estimating social bias in text data. Harmful bias is prevalent in many online sources of data that are used for training machine learning (ML) models. In a step to address this challenge we create a novel metric that involves a two-step process: corpus-level evaluation based on model classification and sentence-level evaluation based on (se… ▽ More

    Submitted 16 September, 2023; v1 submitted 8 April, 2023; originally announced April 2023.

    Comments: Published in Elsevier's Natural Language Processing Journal

  13. arXiv:2303.16985  [pdf, other

    cs.CL cs.AI

    Adapting to the Low-Resource Double-Bind: Investigating Low-Compute Methods on Low-Resource African Languages

    Authors: Colin Leong, Herumb Shandilya, Bonaventure F. P. Dossou, Atnafu Lambebo Tonja, Joel Mathew, Abdul-Hakeem Omotayo, Oreen Yousuf, Zainab Akinjobi, Chris Chinenye Emezue, Shamsudeen Muhammad, Steven Kolawole, Younwoo Choi, Tosin Adewumi

    Abstract: Many natural language processing (NLP) tasks make use of massively pre-trained language models, which are computationally expensive. However, access to high computational resources added to the issue of data scarcity of African languages constitutes a real barrier to research experiments on these languages. In this work, we explore the applicability of low-compute approaches such as language adapt… ▽ More

    Submitted 29 March, 2023; originally announced March 2023.

    Comments: Accepted to AfricaNLP workshop at ICLR2023

  14. arXiv:2301.12139  [pdf, other

    cs.CL

    Bipol: Multi-axes Evaluation of Bias with Explainability in Benchmark Datasets

    Authors: Tosin Adewumi, Isabella Södergren, Lama Alkhaled, Sana Sabah Sabry, Foteini Liwicki, Marcus Liwicki

    Abstract: We investigate five English NLP benchmark datasets (on the superGLUE leaderboard) and two Swedish datasets for bias, along multiple axes. The datasets are the following: Boolean Question (Boolq), CommitmentBank (CB), Winograd Schema Challenge (WSC), Wino-gender diagnostic (AXg), Recognising Textual Entailment (RTE), Swedish CB, and SWEDN. Bias can be harmful and it is known to be common in data, w… ▽ More

    Submitted 16 September, 2023; v1 submitted 28 January, 2023; originally announced January 2023.

    Comments: Accepted at RANLP 2023

  15. arXiv:2210.12391  [pdf, other

    cs.CL

    MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition

    Authors: David Ifeoluwa Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beukman, Chester Palen-Michel, Constantine Lignos, Jesujoba O. Alabi, Shamsuddeen H. Muhammad, Peter Nabende, Cheikh M. Bamba Dione, Andiswa Bukula, Rooweither Mabuya, Bonaventure F. P. Dossou, Blessing Sibanda, Happy Buzaaba, Jonathan Mukiibi, Godson Kalipe, Derguene Mbaye, Amelia Taylor, Fatoumata Kabore, Chris Chinenye Emezue, Anuoluwapo Aremu, Perez Ogayo, Catherine Gitau , et al. (20 additional authors not shown)

    Abstract: African languages are spoken by over a billion people, but are underrepresented in NLP research and development. The challenges impeding progress include the limited availability of annotated datasets, as well as a lack of understanding of the settings where current methods are effective. In this paper, we make progress towards solutions for these challenges, focusing on the task of named entity r… ▽ More

    Submitted 15 November, 2022; v1 submitted 22 October, 2022; originally announced October 2022.

    Comments: Accepted to EMNLP 2022 (updated Github link)

  16. arXiv:2210.10692  [pdf, ps, other

    cs.CL

    Separating Grains from the Chaff: Using Data Filtering to Improve Multilingual Translation for Low-Resourced African Languages

    Authors: Idris Abdulmumin, Michael Beukman, Jesujoba O. Alabi, Chris Emezue, Everlyn Asiko, Tosin Adewumi, Shamsuddeen Hassan Muhammad, Mofetoluwa Adeyemi, Oreen Yousuf, Sahib Singh, Tajuddeen Rabiu Gwadabe

    Abstract: We participated in the WMT 2022 Large-Scale Machine Translation Evaluation for the African Languages Shared Task. This work describes our approach, which is based on filtering the given noisy data using a sentence-pair classifier that was built by fine-tuning a pre-trained language model. To train the classifier, we obtain positive samples (i.e. high-quality parallel sentences) from a gold-standar… ▽ More

    Submitted 20 October, 2022; v1 submitted 19 October, 2022; originally announced October 2022.

    Comments: Accepted at the Seventh Conference on Machine Translation (WMT22)

  17. arXiv:2210.05480  [pdf, other

    cs.CL

    T5 for Hate Speech, Augmented Data and Ensemble

    Authors: Tosin Adewumi, Sana Sabah Sabry, Nosheen Abid, Foteini Liwicki, Marcus Liwicki

    Abstract: We conduct relatively extensive investigations of automatic hate speech (HS) detection using different state-of-the-art (SoTA) baselines over 11 subtasks of 6 different datasets. Our motivation is to determine which of the recent SoTA models is best for automatic hate speech detection and what advantage methods like data augmentation and ensemble may have on the best model, if any. We carry out 6… ▽ More

    Submitted 11 October, 2022; originally announced October 2022.

    Comments: 15 pages, 18 figures

  18. arXiv:2206.11249  [pdf, other

    cs.CL cs.AI cs.LG

    GEMv2: Multilingual NLG Benchmarking in a Single Line of Code

    Authors: Sebastian Gehrmann, Abhik Bhattacharjee, Abinaya Mahendiran, Alex Wang, Alexandros Papangelis, Aman Madaan, Angelina McMillan-Major, Anna Shvets, Ashish Upadhyay, Bingsheng Yao, Bryan Wilie, Chandra Bhagavatula, Chaobin You, Craig Thomson, Cristina Garbacea, Dakuo Wang, Daniel Deutsch, Deyi Xiong, Di Jin, Dimitra Gkatzia, Dragomir Radev, Elizabeth Clark, Esin Durmus, Faisal Ladhak, Filip Ginter , et al. (52 additional authors not shown)

    Abstract: Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, an… ▽ More

    Submitted 24 June, 2022; v1 submitted 22 June, 2022; originally announced June 2022.

  19. arXiv:2205.03666  [pdf, other

    cs.CL

    Vector Representations of Idioms in Conversational Systems

    Authors: Tosin Adewumi, Foteini Liwicki, Marcus Liwicki

    Abstract: We demonstrate, in this study, that an open-domain conversational system trained on idioms or figurative language generates more fitting responses to prompts containing idioms. Idioms are part of everyday speech in many languages, across many cultures, but they pose a great challenge for many Natural Language Processing (NLP) systems that involve tasks such as Information Retrieval (IR) and Machin… ▽ More

    Submitted 7 May, 2022; originally announced May 2022.

    Comments: 7 pages, 1 figure, 8 tables

  20. arXiv:2205.00965  [pdf, other

    cs.CL

    State-of-the-art in Open-domain Conversational AI: A Survey

    Authors: Tosin Adewumi, Foteini Liwicki, Marcus Liwicki

    Abstract: We survey SoTA open-domain conversational AI models with the purpose of presenting the prevailing challenges that still exist to spur future research. In addition, we provide statistics on the gender of conversational AI in order to guide the ethics discussion surrounding the issue. Open-domain conversational AI are known to have several challenges, including bland responses and performance degrad… ▽ More

    Submitted 2 May, 2022; originally announced May 2022.

    Comments: 8 pages, 2 figures

  21. arXiv:2204.08083  [pdf, other

    cs.CL

    AfriWOZ: Corpus for Exploiting Cross-Lingual Transferability for Generation of Dialogues in Low-Resource, African Languages

    Authors: Tosin Adewumi, Mofetoluwa Adeyemi, Aremu Anuoluwapo, Bukola Peters, Happy Buzaaba, Oyerinde Samuel, Amina Mardiyyah Rufai, Benjamin Ajibade, Tajudeen Gwadabe, Mory Moussou Koulibaly Traore, Tunde Ajayi, Shamsuddeen Muhammad, Ahmed Baruwa, Paul Owoicho, Tolulope Ogunremi, Phylis Ngigi, Orevaoghene Ahia, Ruqayya Nasir, Foteini Liwicki, Marcus Liwicki

    Abstract: Dialogue generation is an important NLP task fraught with many challenges. The challenges become more daunting for low-resource African languages. To enable the creation of dialogue agents for African languages, we contribute the first high-quality dialogue datasets for 6 African languages: Swahili, Wolof, Hausa, Nigerian Pidgin English, Kinyarwanda & Yorùbá. These datasets consist of 1,500 turns… ▽ More

    Submitted 19 May, 2022; v1 submitted 17 April, 2022; originally announced April 2022.

    Comments: 14 pages, 1 figure, 8 tables

  22. arXiv:2204.07432  [pdf, other

    cs.CL

    ML_LTU at SemEval-2022 Task 4: T5 Towards Identifying Patronizing and Condescending Language

    Authors: Tosin Adewumi, Lama Alkhaled, Hamam Mokayed, Foteini Liwicki, Marcus Liwicki

    Abstract: This paper describes the system used by the Machine Learning Group of LTU in subtask 1 of the SemEval-2022 Task 4: Patronizing and Condescending Language (PCL) Detection. Our system consists of finetuning a pretrained Text-to-Text-Transfer Transformer (T5) and innovatively reducing its out-of-class predictions. The main contributions of this paper are 1) the description of the implementation detai… ▽ More

    Submitted 5 May, 2022; v1 submitted 15 April, 2022; originally announced April 2022.

    Comments: Accepted at the International Workshop on Semantic Evaluation (2022) co-located with NAACL

  23. arXiv:2202.05690  [pdf, other

    cs.CL

    HaT5: Hate Language Identification using Text-to-Text Transfer Transformer

    Authors: Sana Sabah Sabry, Tosin Adewumi, Nosheen Abid, György Kovacs, Foteini Liwicki, Marcus Liwicki

    Abstract: We investigate the performance of a state-of-the art (SoTA) architecture T5 (available on the SuperGLUE) and compare with it 3 other previous SoTA architectures across 5 different tasks from 2 relatively diverse datasets. The datasets are diverse in terms of the number and types of tasks they have. To improve performance, we augment the training data by using an autoregressive model. We achieve ne… ▽ More

    Submitted 11 February, 2022; originally announced February 2022.

    Comments: 7 pages, 3 figures , conference

    MSC Class: 68

  24. arXiv:2110.06273  [pdf, other

    cs.CL cs.LG

    Småprat: DialoGPT for Natural Language Generation of Swedish Dialogue by Transfer Learning

    Authors: Tosin Adewumi, Rickard Brännvall, Nosheen Abid, Maryam Pahlavan, Sana Sabah Sabry, Foteini Liwicki, Marcus Liwicki

    Abstract: Building open-domain conversational systems (or chatbots) that produce convincing responses is a recognized challenge. Recent state-of-the-art (SoTA) transformer-based models for the generation of natural language dialogue have demonstrated impressive performance in simulating human-like, single-turn conversations in English. This work investigates, by an empirical study, the potential for transfe… ▽ More

    Submitted 13 February, 2022; v1 submitted 12 October, 2021; originally announced October 2021.

    Comments: Presented at Northern Lights Deep Learning Conference (NLDL) 2022, Tromso, Norway

  25. arXiv:2105.03280  [pdf, other

    cs.CL cs.LG

    Potential Idiomatic Expression (PIE)-English: Corpus for Classes of Idioms

    Authors: Tosin P. Adewumi, Roshanak Vadoodi, Aparajita Tripathy, Konstantina Nikolaidou, Foteini Liwicki, Marcus Liwicki

    Abstract: We present a fairly large, Potential Idiomatic Expression (PIE) dataset for Natural Language Processing (NLP) in English. The challenges with NLP systems with regards to tasks such as Machine Translation (MT), word sense disambiguation (WSD) and information retrieval make it imperative to have a labelled idioms dataset with classes such as it is in this work. To the best of the authors' knowledge,… ▽ More

    Submitted 23 April, 2022; v1 submitted 25 April, 2021; originally announced May 2021.

    Comments: Accepted at the International Conference on Language Resources and Evaluation (LREC) 2022

  26. arXiv:2103.11811  [pdf

    cs.CL cs.AI

    MasakhaNER: Named Entity Recognition for African Languages

    Authors: David Ifeoluwa Adelani, Jade Abbott, Graham Neubig, Daniel D'souza, Julia Kreutzer, Constantine Lignos, Chester Palen-Michel, Happy Buzaaba, Shruti Rijhwani, Sebastian Ruder, Stephen Mayhew, Israel Abebe Azime, Shamsuddeen Muhammad, Chris Chinenye Emezue, Joyce Nakatumba-Nabende, Perez Ogayo, Anuoluwapo Aremu, Catherine Gitau, Derguene Mbaye, Jesujoba Alabi, Seid Muhie Yimam, Tajuddeen Gwadabe, Ignatius Ezeani, Rubungo Andre Niyongabo, Jonathan Mukiibi , et al. (36 additional authors not shown)

    Abstract: We take a step towards addressing the under-representation of the African continent in NLP research by creating the first large publicly available high-quality dataset for named entity recognition (NER) in ten African languages, bringing together a variety of stakeholders. We detail characteristics of the languages to help researchers understand the challenges that these languages pose for NER. We… ▽ More

    Submitted 5 July, 2021; v1 submitted 22 March, 2021; originally announced March 2021.

    Comments: Accepted to TACL 2021, pre-MIT Press publication version

  27. arXiv:2102.01672  [pdf, other

    cs.CL cs.AI cs.LG

    The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

    Authors: Sebastian Gehrmann, Tosin Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi, Aremu Anuoluwapo, Antoine Bosselut, Khyathi Raghavi Chandu, Miruna Clinciu, Dipanjan Das, Kaustubh D. Dhole, Wanyu Du, Esin Durmus, Ondřej Dušek, Chris Emezue, Varun Gangal, Cristina Garbacea, Tatsunori Hashimoto, Yufang Hou, Yacine Jernite, Harsh Jhamtani, Yangfeng Ji, Shailza Jolly, Mihir Kale, Dhruv Kumar, Faisal Ladhak , et al. (31 additional authors not shown)

    Abstract: We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. Due to this moving target, new models often still evaluate on divergent anglo-centric corpora with well-established, but flawed, metrics. This disconnect makes it… ▽ More

    Submitted 1 April, 2021; v1 submitted 2 February, 2021; originally announced February 2021.

  28. arXiv:2011.07605  [pdf, ps, other

    cs.CL cs.LG

    The Challenge of Diacritics in Yoruba Embeddings

    Authors: Tosin P. Adewumi, Foteini Liwicki, Marcus Liwicki

    Abstract: The major contributions of this work include the empirical establishment of a better performance for Yoruba embeddings from undiacritized (normalized) dataset and provision of new analogy sets for evaluation. The Yoruba language, being a tonal language, utilizes diacritics (tonal marks) in written form. We show that this affects embedding performance by creating embeddings from exactly the same Wi… ▽ More

    Submitted 15 November, 2020; originally announced November 2020.

    Comments: Presented at NeurIPS 2020 Workshop on Machine Learning for the Developing World

  29. arXiv:2011.03281  [pdf, other

    cs.CL cs.LG

    Corpora Compared: The Case of the Swedish Gigaword & Wikipedia Corpora

    Authors: Tosin P. Adewumi, Foteini Liwicki, Marcus Liwicki

    Abstract: In this work, we show that the difference in performance of embeddings from differently sourced data for a given language can be due to other factors besides data size. Natural language processing (NLP) tasks usually perform better with embeddings from bigger corpora. However, broadness of covered domain and noise can play important roles. We evaluate embeddings based on two Swedish corpora: The G… ▽ More

    Submitted 6 November, 2020; originally announced November 2020.

    Comments: Presented at the Eighth Swedish Language Technology Conference (SLTC)

  30. arXiv:2007.16007  [pdf, other

    cs.CL cs.LG

    Exploring Swedish & English fastText Embeddings for NER with the Transformer

    Authors: Tosin P. Adewumi, Foteini Liwicki, Marcus Liwicki

    Abstract: In this paper, our main contributions are that embeddings from relatively smaller corpora can outperform ones from larger corpora and we make the new Swedish analogy test set publicly available. To achieve a good network performance in natural language processing (NLP) downstream tasks, several factors play important roles: dataset size, the right hyper-parameters, and well-trained embeddings. We… ▽ More

    Submitted 17 April, 2021; v1 submitted 23 July, 2020; originally announced July 2020.

    Comments: 11 pages, 2 figures, 8 tables; added new references and clarification about other possible models for NER

  31. arXiv:2003.11645  [pdf, other

    cs.CL cs.LG stat.ML

    Word2Vec: Optimal Hyper-Parameters and Their Impact on NLP Downstream Tasks

    Authors: Tosin P. Adewumi, Foteini Liwicki, Marcus Liwicki

    Abstract: Word2Vec is a prominent model for natural language processing (NLP) tasks. Similar inspiration is found in distributed embeddings for new state-of-the-art (SotA) deep neural networks. However, wrong combination of hyper-parameters can produce poor quality vectors. The objective of this work is to empirically show optimal combination of hyper-parameters exists and evaluate various combinations. We… ▽ More

    Submitted 17 April, 2021; v1 submitted 23 March, 2020; originally announced March 2020.

    Comments: 8 pages, 7 figures, 6 tables; added new references based on new input in the result section about CI

  32. Inner For-Loop for Speeding Up Blockchain Mining

    Authors: Tosin P. Adewumi, Marcus Liwicki

    Abstract: In this paper, the authors propose to increase the efficiency of blockchain mining by using a population-based approach. Blockchain relies on solving difficult mathematical problems as proof-of-work within a network before blocks are added to the chain. Brute force approach, advocated by some as the fastest algorithm for solving partial hash collisions and implemented in Bitcoin blockchain, implie… ▽ More

    Submitted 26 February, 2020; v1 submitted 6 June, 2019; originally announced June 2019.

    Comments: 6 pages, 1 table and 2 figures

    Journal ref: Open Computer Science, 10(1), pp. 42-47 (2020)