Abstract
Medical concept normalization, which maps clinical entities to concepts in standard terminology, is essential for supporting downstream computational applications in clinical settings. This chapter starts with an overview of existing biomedical terminologies and ontologies, elucidating their pivotal roles within diverse biomedical NLP systems. Then a comprehensive exploration of medical concept normalization approaches, including traditional rule-based methodologies as well as contemporary machine learning and deep learning-based techniques, are introduced. Moreover, this chapter extends its utility by presenting a compendium of available resources, including shared tasks and annotated corpora specifically tailored to concept normalization, to empower and streamline the endeavors of readers engaged in this specialized field of research.
References
Keloth VK, et al. Representing and utilizing clinical textual data for real world studies: An OHDSI approach. J Biomed Inform. 2023;142:104343.
Chapman W, Savova G, Elhadad N. ShARe/CLEF shared task 1 for boundary detection and normalization of SNOMED disorders. In: Proceedings of CLEF. 2013.
Pradhan S, et al. Semeval-2014 task 7: Analysis of clinical text. In: Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014). 2014.
Demner-Fushman D, Rogers WJ, Aronson AR. MetaMap Lite: an evaluation of a new Java implementation of MetaMap. J Am Med Inform Assoc. 2017;24(4):841–4.
Savova GK, et al. Mayo clinical text analysis and knowledge extraction system (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc. 2010;17(5):507–13.
Soysal E, et al. CLAMP–a toolkit for efficiently building customized clinical natural language processing pipelines. J Am Med Inform Assoc. 2018;25(3):331–6.
Apache cTAKES™. 2023. Available from: http://ctakes.apache.org/.
Kate RJ. Normalizing clinical terms using learned edit distance patterns. J Am Med Inform Assoc. 2016;23(2):380–6.
Luo Y-F, et al. The 2019 n2c2/UMass lowell shared task on clinical concept normalization. J Am Med Inform Assoc. 2020;27(10):1529-e1.
RxNorm. 2023. Available from: https://www.nlm.nih.gov/research/umls/rxnorm/index.html.
Nelson SJ, et al. Normalized names for clinical drugs: RxNorm at 6 years. J Am Med Inform Assoc. 2011;18(4):441–8.
Pathak J, Chute CG. Analyzing categorical information in two publicly available drug terminologies: RxNorm and NDF-RT. J Am Med Inform Assoc. 2010;17(4):432–9.
Zeng K, et al. RxNav: a web service for standard drug information. In: AMIA annual symposium proceedings. American Medical Informatics Association. 2006.
Benson T, Grieve G. LOINC. In: Principles of health interoperability: FHIR, HL7 and SNOMED CT. Cham: Springer International Publishing; 2021. p. 325–338.
2020 LOINC Annual Report. 2023. Available from: https://loinc.org/annual-reports/year-2020/.
LOINC International. 2023. Available from: https://loinc.org/international/.
Logical Observation Identifier Names and Codes (LOINC). 2023. Available from: https://loinc.org/oids/2.16.840.1.113883.6.1/.
Bodenreider O, Cornet R, Vreeman DJ. Recent developments in clinical terminologies—SNOMED CT, LOINC, and RxNorm. Yearb Med Inform. 2018;27(01):129–39.
Zunner C, et al. Mapping local laboratory interface terms to LOINC at a German university hospital using RELMA V. 5: a semi-automated approach. J Am Med Inform Assoc. 2013;20(2):293–297.
Yeh C-Y, et al. Logical observation identifiers names and codes (Loinc®) applied to microbiology: a national laboratory mapping experience in Taiwan. Diagnostics. 2021;11(9):1564.
Kopanitsa G. Application of a Regenstrief RELMA V. 6.6 to map Russian laboratory terms to LOINC. Methods Inf Med. 2016;55(02):177–181.
Huser V, Taft LM, Cimino JJ. Suitability of LOINC document ontology as a reference terminology for clinical document types: a case report of a research-oriented EHR. 2023. Available from: https://lhncbc.nlm.nih.gov/LHC-publications/PDF/pub2012-072.pdf.
SNOMED International. 2023. Available from: https://www.snomed.org/.
Cornet R, de Keizer N. Forty years of SNOMED: a literature review. BMC Med Inform Decis Mak. 2008;8(1):1–6.
Overview of SNOMED CT. 2023. Available from: https://www.nlm.nih.gov/healthit/snomedct/snomed_overview.html.
SNOMED CT Introduction: Structure of Domain Coverage. 2023. Available from: https://confluence.ihtsdotools.org/display/DOCEG/Structure+of+Domain+Coverage.
Khorrami F, Ahmadi M, Sheikhtaheri A. Evaluation of SNOMED CT content coverage: a systematic literature review. eHealth, 2018;212–219.
Editorial, ICD‐11. Lancet. 2019;393:2275.
International Statistical Classification of Diseases and Related Health Problems (ICD). 2023. Available from: https://www.who.int/standards/classifications/classification-of-diseases.
Park H-A, Hardiker N. Clinical terminologies: a solution for semantic interoperability. J Korean Soc Med Inform. 2009;15(1):1–11.
Jetté N, et al. The development, evolution, and modifications of ICD-10: challenges to the international comparability of morbidity data. Med Care. 2010;1105–1110.
Perotte A, et al. Diagnosis code assignment: models and evaluation metrics. J Am Med Inform Assoc. 2014;21(2):231–7.
Pérez A, et al. Inferred joint multigram models for medical term normalization according to ICD. Int J Med Informatics. 2018;110:111–7.
Wang Q, et al. A study of entity-linking methods for normalizing Chinese diagnosis and procedure terms to ICD codes. J Biomed Inform. 2020;105:103418.
Introductory Guide MedDRA Version 26.0. 2023. Available from: https://www.meddra.org/how-to-use/support-documentation/english.
Medical Subject Headings. 2023. Available from: https://www.nlm.nih.gov/mesh/intro_preface.html.
MeSH Record Types. 2023. Available from: https://www.nlm.nih.gov/mesh/intro_record_types.html.
The Gene Ontology Resource. 2023. Available from: http://geneontology.org/.
Consortium GO. The gene ontology resource: 20 years and still GOing strong. Nucleic Acid Res. 2019;47(D1):D330–D338.
Gene Ontology overview. [cited 2023 July 24]; Available from: http://geneontology.org/docs/ontology-documentation/.
Saxena R, Bishnoi R, Singla D. Gene ontology: application and importance in functional annotation of the genomic data. In: Bioinformatics. Elsevier; 2022. p. 145–57.
Role of gene ontology in bioinformatics and bioremediation studies. 2023. Available from: https://www.projectguru.in/gene-ontology-bioremediation/.
Smith B, et al. The OBO foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol. 2007;25(11):1251–5.
OBO Foundry, Principles: Overview. 2023. Available from: http://obofoundry.org/principles/fp-000-summary.html.
Open Biological and Biomedical Ontology Foundry, Community development of interoperable ontologies for the biological sciences. 2023. Available from: http://obofoundry.org/.
Aronson AR. Metamap: mapping text to the umls metathesaurus, vol. 1. Bethesda, MD: NLM, NIH, DHHS; 2006. p. 26.
Xu H, et al. MedEx: a medication information extraction system for clinical narratives. J Am Med Inform Assoc. 2010;17(1):19–24.
Zhou L, et al. Mapping partners master drug dictionary to RxNorm using an NLP-based approach. J Biomed Inform. 2012;45(4):626–33.
RELMA version 7.0 Users’ manual. 2023. Available from: https://loinc.org/kb/relma/overview/.
Dong X, et al. COVID-19 TestNorm: a tool to normalize COVID-19 testing names to LOINC codes. J Am Med Inform Assoc. 2020;27(9):1437–42.
Gaudet-Blavignac C, et al. Use of the systematized nomenclature of medicine clinical terms (SNOMED CT) for processing free text in health care: systematic scoping review. J Med Internet Res. 2021;23(1):e24594.
Chen P-F, et al. Automatic ICD-10 coding and training system: deep neural network based on supervised learning. JMIR Med Inform. 2021;9(8):e23230.
Chraibi A, et al. A deep learning framework for automated ICD-10 coding. In: MIE. 2021.
Ly T, et al. Evaluation of natural language processing (NLP) systems to annotate drug product labeling with MedDRA terminology. J Biomed Inform. 2018;83:73–86.
MeSH on Demand. 2023. Available from: https://www.nlm.nih.gov/oet/ed/mesh/meshondemand.html.
Beasley L, Manda P. Comparison of natural language processing tools for automatic gene ontology annotation of scientific literature. PeerJ Preprints. 2018.
Friedman C, et al. Automated encoding of clinical documents based on natural language processing. J Am Med Inform Assoc. 2004;11(5):392–402.
Pratt AW, Pacak MG. Automated processing of medical English. In: International conference on computational linguistics COLING 1969: preprint no. 11. 1969.
Ji H, et al. Overview of the TAC 2010 knowledge base population track. In: Third text analysis conference (TAC 2010). 2010.
Hachey B, et al. Evaluating entity linking with wikipedia. Artif Intell. 2013;194:130–50.
Zhang W, et al. I2R-NUS-MSRA at TAC 2011: entity linking. In: TAC. 2011.
Miller GA. WordNet: a lexical database for English. Commun ACM. 1995;38(11):39–41.
Dolan W, et al. Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. 2004.
Bodenreider, O., The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acid Res. 2004;32(suppl_1):D267–D270.
Lehmann J, et al. LCC approaches to knowledge base population at TAC 2010. In: TAC. 2010.
Monahan S, et al. Cross-lingual cross-document coreference with entity linking. In: TAC. 2011.
Dredze M, et al. Entity disambiguation for knowledge base population. In: Proceedings of the 23rd international conference on computational linguistics. 2010.
Kulkarni S, et al. Collective annotation of wikipedia entities in web text. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. 2009.
Zhang Y, et al. UTH_CCB: a report for semeval 2014–task 7 analysis of clinical text. In: Proceedings of the 8th INTERNATIONAL WORKSHOP ON SEMANTIC EVAluation (SemEval 2014). 2014.
Wu H-Y, et al. An integrated pharmacokinetics ontology and corpus for text mining. BMC Bioinformatics. 2013;14:1–15.
Xu J, et al. UTH-CCB@ BioCreative V CDR task: identifying chemical-induced disease relations in biomedical text. In: Proceedings of the fifth biocreative challenge evaluation Workshop. 2015.
Li H, et al. CNN-based ranking for biomedical entity normalization. BMC Bioinformatics. 2017;18:79–86.
Zheng Z, et al. Learning to link entities with knowledge base. In: Human language technologies: the 2010 annual conference of the North American chapter of the association for computational linguistics. 2010.
Hoffart J, et al. Robust disambiguation of named entities in text. In: Proceedings of the 2011 conference on empirical methods in natural language processing. 2011.
Liu T-Y. Learning to rank for information retrieval. Found Trends Inf Retrieval. 2009;3(3):225–331.
Li H. Learning to rank for information retrieval and natural language processing. Springer Nature; 2022.
Leaman R, Islamaj Doğan R, Lu Z, DNorm: disease name normalization with pairwise learning to rank. Bioinformatics. 2013;29(22):2909–2917.
Zhang W, et al. Entity linking with effective acronym expansion, instance selection and topic modeling. In: Twenty-Second international joint conference on artificial intelligence. 2011.
Han X, Sun L, Zhao J. Collective entity linking in web text: a graph-based method. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval. 2011.
Ji Z, et al. Joint recognition and linking of fine-grained locations from tweets. In: Proceedings of the 25th international conference on world wide web. 2016.
Schumacher E, Mulyar A, Dredze M. Clinical concept linking with contextualized neural representations. In: Proceedings of the 58th annual meeting of the association for computational linguistics. 2020.
Xu D, Zhang Z, Bethard S. A generate-and-rank framework with semantic type regularization for biomedical concept normalization. In: Proceedings of the 58th annual meeting of the association for computational linguistics. 2020.
CLEF: Conference and Labs of the Evaluation Forum. 2023. Available from: http://www.clef-initiative.eu/.
The 3rd Clinical Natural Language Processing Workshop, EMNLP 2020. 2020. Available from: https://clinical-nlp.github.io/2020/resources.html.
SENSEVAL. 2023. Available from: https://web.eecs.umich.edu/~mihalcea/senseval/.
SemEval-2014 Task 7: analysis of clinical text. 2023. Available from: https://alt.qcri.org/semeval2014/task7/.
Henry S, et al. The 2019 National Natural language processing (NLP) Clinical Challenges (n2c2)/Open Health NLP (OHNLP) shared task on clinical concept normalization for clinical records. J Am Med Inform Assoc JAMIA. 2020;27(10):1529–37.
Blaschke C, et al. A critical assessment of text mining methods in molecular biology. BMC Bioinformatics. 2004;6.
Morgan AA, et al. Overview of biocreative II gene normalization. Genome Biol. 2008;9:1–19.
Maglott D, et al. Entrez gene: gene-centered information at NCBI. Nucleic Acid Res. 2005:33(suppl_1):D54–D58.
Leitner F, et al. An overview of BioCreative II. 5. IEEE/ACM Trans Comput Biol Bioinf. 2010;7(3):385–399.
UniProt: the universal protein knowledgebase in 2021. Nucleic Acid Res. 2021;49(D1):D480–D489.
Lu Z, et al. The gene normalization task in BioCreative III. BMC Bioinformatics. 2011;12:1–19.
Carroll HD, et al. Threshold average precision (TAP-k): a measure of retrieval designed for bioinformatics. Bioinformatics. 2010;26(14):1708–13.
Leaman R, Islamaj R, Lu, Z. The overview of the NLM-Chem BioCreative VII track.
Index of /pub/lu/BC7-NLM-Chem-track. 2023. Available from: https://ftp.ncbi.nlm.nih.gov/pub/lu/BC7-NLM-Chem-track/.
Doğan RI, Leaman R, Lu Z. NCBI disease corpus: a resource for disease name recognition and concept normalization. J Biomed Inform. 2014;47:1–10.
Mohan S, Li D. Medmentions: a large biomedical corpus annotated with umls concepts. 2019. arXiv preprint arXiv:1902.09476.
The ST21pv subset of the full MedMentions dataset. [cited 2023 Sept]. Available from: https://github.com/chanzuckerberg/MedMentions/tree/master/st21pv.
Roberts K, Demner-Fushman D, Tonning JM. Overview of the TAC 2017 adverse reaction extraction from drug labels track. In: TAC. 2017.
Bayer S, et al. ADE eval: an evaluation of text processing systems for adverse event extraction from drug labels for pharmacovigilance. Drug Saf. 2021;44:83–94.
Sarker A, Gonzalez-Hernandez G. Overview of the second social media mining for health (SMM4H) shared tasks at AMIA 2017. Training. 2017;1(10,822):1239.
Limsopatham N, Collier N. Adapting phrase-based machine translation to normalise medical terms in social media messages. 2015. arXiv preprint arXiv:1508.02285.
Limsopatham N, Collier N. Normalising medical concepts in social media texts by learning semantic representation. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers). 2016.
Zolnoori M, et al. The PsyTAR dataset: from patients generated narratives to a corpus of adverse drug events and effectiveness of psychiatric medications. Data Brief. 2019;24:103838.
Karimi S, et al. Cadec: a corpus of adverse drug event annotations. J Biomed Inform. 2015;55:73–81.
Belousov M, Dixon WG, Nenadic G. Mednorm: a corpus and embeddings for cross-terminology medical concept normalisation. In: Proceedings of the fourth social media mining for health applications (# SMM4H) workshop & shared task. 2019.
Alvaro N, Miyao Y, Collier N. TwiMed: twitter and PubMed comparable corpus of drugs, diseases, symptoms, and their relations. JMIR Public Health Surveill. 2017;3(2):e6396.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Glossary
- Medical concept normalization
-
A process in healthcare informatics and natural language processing (NLP) that involves standardizing and mapping medical terms mentioned in free text to standardized concept ides (or codes) in controlled medical terminologies or ontologies.
- Common data model (CDM)
-
A standardized, structured, and unified way of organizing and representing data from diverse sources in a consistent format.
- Controlled vocabularies
-
A finite, enumerated set of terms intended to convey information unambiguously.
- Observational Health Data Science and Informatics (OHDSI)
-
A multi-stakeholder, interdisciplinary collaborative to bring out the value of health data through large-scale analytics.
- Terminology
-
A set of terms representing the system of concepts of a particular subject field.
- Ontology
-
A description (like a formal specification of a program) of the concepts and relationships that can exist for an agent or a community of agents. In biomedicine, such ontologies typically specify the meanings and hierarchical relationships among terms and concepts in a domain.
- Interoperability
-
The ability of different systems, applications, or components to seamlessly communicate, exchange data, and work together effectively.
- Semantic similarity
-
A measure of how similar or related the meanings of two words, phrases, sentences, or documents are.
- Entity linking
-
A step of natural language processing (NLP), after finding a named entity in a document, for linking (normalizing) that entity to an appropriate entry in a database. Medical concept normalization is a special case of entity linking.
- Lexical variation
-
A linguistic phenomenon in which different contexts use different words or expressions to refer to the same concept.
- Polysemy
-
A linguistic phenomenon where a single word or phrase has multiple related meanings or senses
- Granularity
-
The level of detail or specificity at which meaning is represented in language or knowledge.
- Unified Medical Language System (UMLS)
-
A terminology system, developed under the direction of the National Library of Medicine, to produce a common structure that ties together the various vocabularies that have been created for biomedical domains.
- URI/Identifier
-
Uniform resource identifier (URI) refers to the combination of a URN and URL, intended to provide persistent access to digital objects.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Xu, H., Demner Fushman, D., Hong, N., Raja, K. (2024). Medical Concept Normalization. In: Xu, H., Demner Fushman, D. (eds) Natural Language Processing in Biomedicine. Cognitive Informatics in Biomedicine and Healthcare. Springer, Cham. https://doi.org/10.1007/978-3-031-55865-8_6
Download citation
DOI: https://doi.org/10.1007/978-3-031-55865-8_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-55864-1
Online ISBN: 978-3-031-55865-8
eBook Packages: MedicineMedicine (R0)