Medical Concept Normalization

Part of the book series: Cognitive Informatics in Biomedicine and Healthcare ((CIBH))

182 Accesses

Abstract

Medical concept normalization, which maps clinical entities to concepts in standard terminology, is essential for supporting downstream computational applications in clinical settings. This chapter starts with an overview of existing biomedical terminologies and ontologies, elucidating their pivotal roles within diverse biomedical NLP systems. Then a comprehensive exploration of medical concept normalization approaches, including traditional rule-based methodologies as well as contemporary machine learning and deep learning-based techniques, are introduced. Moreover, this chapter extends its utility by presenting a compendium of available resources, including shared tasks and annotated corpora specifically tailored to concept normalization, to empower and streamline the endeavors of readers engaged in this specialized field of research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

Keloth VK, et al. Representing and utilizing clinical textual data for real world studies: An OHDSI approach. J Biomed Inform. 2023;142:104343.
Article PubMed PubMed Central Google Scholar
Chapman W, Savova G, Elhadad N. ShARe/CLEF shared task 1 for boundary detection and normalization of SNOMED disorders. In: Proceedings of CLEF. 2013.
Google Scholar
Pradhan S, et al. Semeval-2014 task 7: Analysis of clinical text. In: Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014). 2014.
Google Scholar
Demner-Fushman D, Rogers WJ, Aronson AR. MetaMap Lite: an evaluation of a new Java implementation of MetaMap. J Am Med Inform Assoc. 2017;24(4):841–4.
Article PubMed PubMed Central Google Scholar
Savova GK, et al. Mayo clinical text analysis and knowledge extraction system (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc. 2010;17(5):507–13.
Article PubMed PubMed Central Google Scholar
Soysal E, et al. CLAMP–a toolkit for efficiently building customized clinical natural language processing pipelines. J Am Med Inform Assoc. 2018;25(3):331–6.
Article PubMed Google Scholar
Apache cTAKES™. 2023. Available from: http://ctakes.apache.org/.
Kate RJ. Normalizing clinical terms using learned edit distance patterns. J Am Med Inform Assoc. 2016;23(2):380–6.
Article PubMed Google Scholar
Luo Y-F, et al. The 2019 n2c2/UMass lowell shared task on clinical concept normalization. J Am Med Inform Assoc. 2020;27(10):1529-e1.
Article PubMed Central Google Scholar
RxNorm. 2023. Available from: https://www.nlm.nih.gov/research/umls/rxnorm/index.html.
Nelson SJ, et al. Normalized names for clinical drugs: RxNorm at 6 years. J Am Med Inform Assoc. 2011;18(4):441–8.
Article PubMed PubMed Central Google Scholar
Pathak J, Chute CG. Analyzing categorical information in two publicly available drug terminologies: RxNorm and NDF-RT. J Am Med Inform Assoc. 2010;17(4):432–9.
Article PubMed PubMed Central Google Scholar
Zeng K, et al. RxNav: a web service for standard drug information. In: AMIA annual symposium proceedings. American Medical Informatics Association. 2006.
Google Scholar
Benson T, Grieve G. LOINC. In: Principles of health interoperability: FHIR, HL7 and SNOMED CT. Cham: Springer International Publishing; 2021. p. 325–338.
Google Scholar
2020 LOINC Annual Report. 2023. Available from: https://loinc.org/annual-reports/year-2020/.
LOINC International. 2023. Available from: https://loinc.org/international/.
Logical Observation Identifier Names and Codes (LOINC). 2023. Available from: https://loinc.org/oids/2.16.840.1.113883.6.1/.
Bodenreider O, Cornet R, Vreeman DJ. Recent developments in clinical terminologies—SNOMED CT, LOINC, and RxNorm. Yearb Med Inform. 2018;27(01):129–39.
Article PubMed PubMed Central Google Scholar
Zunner C, et al. Mapping local laboratory interface terms to LOINC at a German university hospital using RELMA V. 5: a semi-automated approach. J Am Med Inform Assoc. 2013;20(2):293–297.
Google Scholar
Yeh C-Y, et al. Logical observation identifiers names and codes (Loinc®) applied to microbiology: a national laboratory mapping experience in Taiwan. Diagnostics. 2021;11(9):1564.
Article CAS PubMed PubMed Central Google Scholar
Kopanitsa G. Application of a Regenstrief RELMA V. 6.6 to map Russian laboratory terms to LOINC. Methods Inf Med. 2016;55(02):177–181.
Google Scholar
Huser V, Taft LM, Cimino JJ. Suitability of LOINC document ontology as a reference terminology for clinical document types: a case report of a research-oriented EHR. 2023. Available from: https://lhncbc.nlm.nih.gov/LHC-publications/PDF/pub2012-072.pdf.
SNOMED International. 2023. Available from: https://www.snomed.org/.
Cornet R, de Keizer N. Forty years of SNOMED: a literature review. BMC Med Inform Decis Mak. 2008;8(1):1–6.
Google Scholar
Overview of SNOMED CT. 2023. Available from: https://www.nlm.nih.gov/healthit/snomedct/snomed_overview.html.
SNOMED CT Introduction: Structure of Domain Coverage. 2023. Available from: https://confluence.ihtsdotools.org/display/DOCEG/Structure+of+Domain+Coverage.
Khorrami F, Ahmadi M, Sheikhtaheri A. Evaluation of SNOMED CT content coverage: a systematic literature review. eHealth, 2018;212–219.
Google Scholar
Editorial, ICD‐11. Lancet. 2019;393:2275.
Google Scholar
International Statistical Classification of Diseases and Related Health Problems (ICD). 2023. Available from: https://www.who.int/standards/classifications/classification-of-diseases.
Park H-A, Hardiker N. Clinical terminologies: a solution for semantic interoperability. J Korean Soc Med Inform. 2009;15(1):1–11.
Article CAS Google Scholar
Jetté N, et al. The development, evolution, and modifications of ICD-10: challenges to the international comparability of morbidity data. Med Care. 2010;1105–1110.
Google Scholar
Perotte A, et al. Diagnosis code assignment: models and evaluation metrics. J Am Med Inform Assoc. 2014;21(2):231–7.
Article PubMed Google Scholar
Pérez A, et al. Inferred joint multigram models for medical term normalization according to ICD. Int J Med Informatics. 2018;110:111–7.
Article Google Scholar
Wang Q, et al. A study of entity-linking methods for normalizing Chinese diagnosis and procedure terms to ICD codes. J Biomed Inform. 2020;105:103418.
Article PubMed Google Scholar
Introductory Guide MedDRA Version 26.0. 2023. Available from: https://www.meddra.org/how-to-use/support-documentation/english.
Medical Subject Headings. 2023. Available from: https://www.nlm.nih.gov/mesh/intro_preface.html.
MeSH Record Types. 2023. Available from: https://www.nlm.nih.gov/mesh/intro_record_types.html.
The Gene Ontology Resource. 2023. Available from: http://geneontology.org/.
Consortium GO. The gene ontology resource: 20 years and still GOing strong. Nucleic Acid Res. 2019;47(D1):D330–D338.
Google Scholar
Gene Ontology overview. [cited 2023 July 24]; Available from: http://geneontology.org/docs/ontology-documentation/.
Saxena R, Bishnoi R, Singla D. Gene ontology: application and importance in functional annotation of the genomic data. In: Bioinformatics. Elsevier; 2022. p. 145–57.
Chapter Google Scholar
Role of gene ontology in bioinformatics and bioremediation studies. 2023. Available from: https://www.projectguru.in/gene-ontology-bioremediation/.
Smith B, et al. The OBO foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol. 2007;25(11):1251–5.
Article CAS PubMed PubMed Central Google Scholar
OBO Foundry, Principles: Overview. 2023. Available from: http://obofoundry.org/principles/fp-000-summary.html.
Open Biological and Biomedical Ontology Foundry, Community development of interoperable ontologies for the biological sciences. 2023. Available from: http://obofoundry.org/.
Aronson AR. Metamap: mapping text to the umls metathesaurus, vol. 1. Bethesda, MD: NLM, NIH, DHHS; 2006. p. 26.
Google Scholar
Xu H, et al. MedEx: a medication information extraction system for clinical narratives. J Am Med Inform Assoc. 2010;17(1):19–24.
Article CAS PubMed PubMed Central Google Scholar
Zhou L, et al. Mapping partners master drug dictionary to RxNorm using an NLP-based approach. J Biomed Inform. 2012;45(4):626–33.
Article PubMed Google Scholar
RELMA version 7.0 Users’ manual. 2023. Available from: https://loinc.org/kb/relma/overview/.
Dong X, et al. COVID-19 TestNorm: a tool to normalize COVID-19 testing names to LOINC codes. J Am Med Inform Assoc. 2020;27(9):1437–42.
Article PubMed PubMed Central Google Scholar
Gaudet-Blavignac C, et al. Use of the systematized nomenclature of medicine clinical terms (SNOMED CT) for processing free text in health care: systematic scoping review. J Med Internet Res. 2021;23(1):e24594.
Article PubMed PubMed Central Google Scholar
Chen P-F, et al. Automatic ICD-10 coding and training system: deep neural network based on supervised learning. JMIR Med Inform. 2021;9(8):e23230.
Article PubMed PubMed Central Google Scholar
Chraibi A, et al. A deep learning framework for automated ICD-10 coding. In: MIE. 2021.
Google Scholar
Ly T, et al. Evaluation of natural language processing (NLP) systems to annotate drug product labeling with MedDRA terminology. J Biomed Inform. 2018;83:73–86.
Article PubMed Google Scholar
MeSH on Demand. 2023. Available from: https://www.nlm.nih.gov/oet/ed/mesh/meshondemand.html.
Beasley L, Manda P. Comparison of natural language processing tools for automatic gene ontology annotation of scientific literature. PeerJ Preprints. 2018.
Google Scholar
Friedman C, et al. Automated encoding of clinical documents based on natural language processing. J Am Med Inform Assoc. 2004;11(5):392–402.
Article PubMed PubMed Central Google Scholar
Pratt AW, Pacak MG. Automated processing of medical English. In: International conference on computational linguistics COLING 1969: preprint no. 11. 1969.
Google Scholar
Ji H, et al. Overview of the TAC 2010 knowledge base population track. In: Third text analysis conference (TAC 2010). 2010.
Google Scholar
Hachey B, et al. Evaluating entity linking with wikipedia. Artif Intell. 2013;194:130–50.
Article Google Scholar
Zhang W, et al. I2R-NUS-MSRA at TAC 2011: entity linking. In: TAC. 2011.
Google Scholar
Miller GA. WordNet: a lexical database for English. Commun ACM. 1995;38(11):39–41.
Article Google Scholar
Dolan W, et al. Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. 2004.
Google Scholar
Bodenreider, O., The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acid Res. 2004;32(suppl_1):D267–D270.
Google Scholar
Lehmann J, et al. LCC approaches to knowledge base population at TAC 2010. In: TAC. 2010.
Google Scholar
Monahan S, et al. Cross-lingual cross-document coreference with entity linking. In: TAC. 2011.
Google Scholar
Dredze M, et al. Entity disambiguation for knowledge base population. In: Proceedings of the 23rd international conference on computational linguistics. 2010.
Google Scholar
Kulkarni S, et al. Collective annotation of wikipedia entities in web text. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. 2009.
Google Scholar
Zhang Y, et al. UTH_CCB: a report for semeval 2014–task 7 analysis of clinical text. In: Proceedings of the 8th INTERNATIONAL WORKSHOP ON SEMANTIC EVAluation (SemEval 2014). 2014.
Google Scholar
Wu H-Y, et al. An integrated pharmacokinetics ontology and corpus for text mining. BMC Bioinformatics. 2013;14:1–15.
Article CAS Google Scholar
Xu J, et al. UTH-CCB@ BioCreative V CDR task: identifying chemical-induced disease relations in biomedical text. In: Proceedings of the fifth biocreative challenge evaluation Workshop. 2015.
Google Scholar
Li H, et al. CNN-based ranking for biomedical entity normalization. BMC Bioinformatics. 2017;18:79–86.
Article Google Scholar
Zheng Z, et al. Learning to link entities with knowledge base. In: Human language technologies: the 2010 annual conference of the North American chapter of the association for computational linguistics. 2010.
Google Scholar
Hoffart J, et al. Robust disambiguation of named entities in text. In: Proceedings of the 2011 conference on empirical methods in natural language processing. 2011.
Google Scholar
Liu T-Y. Learning to rank for information retrieval. Found Trends Inf Retrieval. 2009;3(3):225–331.
Google Scholar
Li H. Learning to rank for information retrieval and natural language processing. Springer Nature; 2022.
Google Scholar
Leaman R, Islamaj Doğan R, Lu Z, DNorm: disease name normalization with pairwise learning to rank. Bioinformatics. 2013;29(22):2909–2917.
Google Scholar
Zhang W, et al. Entity linking with effective acronym expansion, instance selection and topic modeling. In: Twenty-Second international joint conference on artificial intelligence. 2011.
Google Scholar
Han X, Sun L, Zhao J. Collective entity linking in web text: a graph-based method. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval. 2011.
Google Scholar
Ji Z, et al. Joint recognition and linking of fine-grained locations from tweets. In: Proceedings of the 25th international conference on world wide web. 2016.
Google Scholar
Schumacher E, Mulyar A, Dredze M. Clinical concept linking with contextualized neural representations. In: Proceedings of the 58th annual meeting of the association for computational linguistics. 2020.
Google Scholar
Xu D, Zhang Z, Bethard S. A generate-and-rank framework with semantic type regularization for biomedical concept normalization. In: Proceedings of the 58th annual meeting of the association for computational linguistics. 2020.
Google Scholar
CLEF: Conference and Labs of the Evaluation Forum. 2023. Available from: http://www.clef-initiative.eu/.
The 3rd Clinical Natural Language Processing Workshop, EMNLP 2020. 2020. Available from: https://clinical-nlp.github.io/2020/resources.html.
SENSEVAL. 2023. Available from: https://web.eecs.umich.edu/~mihalcea/senseval/.
SemEval-2014 Task 7: analysis of clinical text. 2023. Available from: https://alt.qcri.org/semeval2014/task7/.
Henry S, et al. The 2019 National Natural language processing (NLP) Clinical Challenges (n2c2)/Open Health NLP (OHNLP) shared task on clinical concept normalization for clinical records. J Am Med Inform Assoc JAMIA. 2020;27(10):1529–37.
PubMed Google Scholar
Blaschke C, et al. A critical assessment of text mining methods in molecular biology. BMC Bioinformatics. 2004;6.
Google Scholar
Morgan AA, et al. Overview of biocreative II gene normalization. Genome Biol. 2008;9:1–19.
Article Google Scholar
Maglott D, et al. Entrez gene: gene-centered information at NCBI. Nucleic Acid Res. 2005:33(suppl_1):D54–D58.
Google Scholar
Leitner F, et al. An overview of BioCreative II. 5. IEEE/ACM Trans Comput Biol Bioinf. 2010;7(3):385–399.
Google Scholar
UniProt: the universal protein knowledgebase in 2021. Nucleic Acid Res. 2021;49(D1):D480–D489.
Google Scholar
Lu Z, et al. The gene normalization task in BioCreative III. BMC Bioinformatics. 2011;12:1–19.
Article Google Scholar
Carroll HD, et al. Threshold average precision (TAP-k): a measure of retrieval designed for bioinformatics. Bioinformatics. 2010;26(14):1708–13.
Article CAS PubMed PubMed Central Google Scholar
Leaman R, Islamaj R, Lu, Z. The overview of the NLM-Chem BioCreative VII track.
Google Scholar
Index of /pub/lu/BC7-NLM-Chem-track. 2023. Available from: https://ftp.ncbi.nlm.nih.gov/pub/lu/BC7-NLM-Chem-track/.
Doğan RI, Leaman R, Lu Z. NCBI disease corpus: a resource for disease name recognition and concept normalization. J Biomed Inform. 2014;47:1–10.
Article PubMed PubMed Central Google Scholar
Mohan S, Li D. Medmentions: a large biomedical corpus annotated with umls concepts. 2019. arXiv preprint arXiv:1902.09476.
The ST21pv subset of the full MedMentions dataset. [cited 2023 Sept]. Available from: https://github.com/chanzuckerberg/MedMentions/tree/master/st21pv.
Roberts K, Demner-Fushman D, Tonning JM. Overview of the TAC 2017 adverse reaction extraction from drug labels track. In: TAC. 2017.
Google Scholar
Bayer S, et al. ADE eval: an evaluation of text processing systems for adverse event extraction from drug labels for pharmacovigilance. Drug Saf. 2021;44:83–94.
Article PubMed Google Scholar
Sarker A, Gonzalez-Hernandez G. Overview of the second social media mining for health (SMM4H) shared tasks at AMIA 2017. Training. 2017;1(10,822):1239.
Google Scholar
Limsopatham N, Collier N. Adapting phrase-based machine translation to normalise medical terms in social media messages. 2015. arXiv preprint arXiv:1508.02285.
Limsopatham N, Collier N. Normalising medical concepts in social media texts by learning semantic representation. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers). 2016.
Google Scholar
Zolnoori M, et al. The PsyTAR dataset: from patients generated narratives to a corpus of adverse drug events and effectiveness of psychiatric medications. Data Brief. 2019;24:103838.
Article PubMed PubMed Central Google Scholar
Karimi S, et al. Cadec: a corpus of adverse drug event annotations. J Biomed Inform. 2015;55:73–81.
Article PubMed Google Scholar
Belousov M, Dixon WG, Nenadic G. Mednorm: a corpus and embeddings for cross-terminology medical concept normalisation. In: Proceedings of the fourth social media mining for health applications (# SMM4H) workshop & shared task. 2019.
Google Scholar
Alvaro N, Miyao Y, Collier N. TwiMed: twitter and PubMed comparable corpus of drugs, diseases, symptoms, and their relations. JMIR Public Health Surveill. 2017;3(2):e6396.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Yale University, 100 College St, New Haven, CT, 06510, USA
Hua Xu, Na Hong & Kalpana Raja
National Library of Medicine, 8600 Rockville Pike, Bethesda, MD, 20894, USA
Dina Demner Fushman

Authors

Hua Xu
View author publications
You can also search for this author in PubMed Google Scholar
Dina Demner Fushman
View author publications
You can also search for this author in PubMed Google Scholar
Na Hong
View author publications
You can also search for this author in PubMed Google Scholar
Kalpana Raja
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hua Xu .

Editor information

Editors and Affiliations

Yale University, New Haven, CT, USA
Hua Xu
United States National Library of Medicine, Bethesda, MD, USA
Dina Demner Fushman

Glossary

Medical concept normalization: A process in healthcare informatics and natural language processing (NLP) that involves standardizing and mapping medical terms mentioned in free text to standardized concept ides (or codes) in controlled medical terminologies or ontologies.
Common data model (CDM): A standardized, structured, and unified way of organizing and representing data from diverse sources in a consistent format.
Controlled vocabularies: A finite, enumerated set of terms intended to convey information unambiguously.
Observational Health Data Science and Informatics (OHDSI): A multi-stakeholder, interdisciplinary collaborative to bring out the value of health data through large-scale analytics.
Terminology: A set of terms representing the system of concepts of a particular subject field.
Ontology: A description (like a formal specification of a program) of the concepts and relationships that can exist for an agent or a community of agents. In biomedicine, such ontologies typically specify the meanings and hierarchical relationships among terms and concepts in a domain.
Interoperability: The ability of different systems, applications, or components to seamlessly communicate, exchange data, and work together effectively.
Semantic similarity: A measure of how similar or related the meanings of two words, phrases, sentences, or documents are.
Entity linking: A step of natural language processing (NLP), after finding a named entity in a document, for linking (normalizing) that entity to an appropriate entry in a database. Medical concept normalization is a special case of entity linking.
Lexical variation: A linguistic phenomenon in which different contexts use different words or expressions to refer to the same concept.
Polysemy: A linguistic phenomenon where a single word or phrase has multiple related meanings or senses
Granularity: The level of detail or specificity at which meaning is represented in language or knowledge.
Unified Medical Language System (UMLS): A terminology system, developed under the direction of the National Library of Medicine, to produce a common structure that ties together the various vocabularies that have been created for biomedical domains.
URI/Identifier: Uniform resource identifier (URI) refers to the combination of a URN and URL, intended to provide persistent access to digital objects.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Xu, H., Demner Fushman, D., Hong, N., Raja, K. (2024). Medical Concept Normalization. In: Xu, H., Demner Fushman, D. (eds) Natural Language Processing in Biomedicine. Cognitive Informatics in Biomedicine and Healthcare. Springer, Cham. https://doi.org/10.1007/978-3-031-55865-8_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-55865-8_6
Published: 09 June 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-55864-1
Online ISBN: 978-3-031-55865-8
eBook Packages: MedicineMedicine (R0)

Publish with us

Policies and ethics