Skip to main content

Medical Concept Normalization

  • Chapter
  • First Online:
Natural Language Processing in Biomedicine

Part of the book series: Cognitive Informatics in Biomedicine and Healthcare ((CIBH))

  • 182 Accesses

Abstract

Medical concept normalization, which maps clinical entities to concepts in standard terminology, is essential for supporting downstream computational applications in clinical settings. This chapter starts with an overview of existing biomedical terminologies and ontologies, elucidating their pivotal roles within diverse biomedical NLP systems. Then a comprehensive exploration of medical concept normalization approaches, including traditional rule-based methodologies as well as contemporary machine learning and deep learning-based techniques, are introduced. Moreover, this chapter extends its utility by presenting a compendium of available resources, including shared tasks and annotated corpora specifically tailored to concept normalization, to empower and streamline the endeavors of readers engaged in this specialized field of research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

  1. Keloth VK, et al. Representing and utilizing clinical textual data for real world studies: An OHDSI approach. J Biomed Inform. 2023;142:104343.

    Article  PubMed  PubMed Central  Google Scholar 

  2. Chapman W, Savova G, Elhadad N. ShARe/CLEF shared task 1 for boundary detection and normalization of SNOMED disorders. In: Proceedings of CLEF. 2013.

    Google Scholar 

  3. Pradhan S, et al. Semeval-2014 task 7: Analysis of clinical text. In: Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014). 2014.

    Google Scholar 

  4. Demner-Fushman D, Rogers WJ, Aronson AR. MetaMap Lite: an evaluation of a new Java implementation of MetaMap. J Am Med Inform Assoc. 2017;24(4):841–4.

    Article  PubMed  PubMed Central  Google Scholar 

  5. Savova GK, et al. Mayo clinical text analysis and knowledge extraction system (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc. 2010;17(5):507–13.

    Article  PubMed  PubMed Central  Google Scholar 

  6. Soysal E, et al. CLAMP–a toolkit for efficiently building customized clinical natural language processing pipelines. J Am Med Inform Assoc. 2018;25(3):331–6.

    Article  PubMed  Google Scholar 

  7. Apache cTAKES™. 2023. Available from: http://ctakes.apache.org/.

  8. Kate RJ. Normalizing clinical terms using learned edit distance patterns. J Am Med Inform Assoc. 2016;23(2):380–6.

    Article  PubMed  Google Scholar 

  9. Luo Y-F, et al. The 2019 n2c2/UMass lowell shared task on clinical concept normalization. J Am Med Inform Assoc. 2020;27(10):1529-e1.

    Article  PubMed Central  Google Scholar 

  10. RxNorm. 2023. Available from: https://www.nlm.nih.gov/research/umls/rxnorm/index.html.

  11. Nelson SJ, et al. Normalized names for clinical drugs: RxNorm at 6 years. J Am Med Inform Assoc. 2011;18(4):441–8.

    Article  PubMed  PubMed Central  Google Scholar 

  12. Pathak J, Chute CG. Analyzing categorical information in two publicly available drug terminologies: RxNorm and NDF-RT. J Am Med Inform Assoc. 2010;17(4):432–9.

    Article  PubMed  PubMed Central  Google Scholar 

  13. Zeng K, et al. RxNav: a web service for standard drug information. In: AMIA annual symposium proceedings. American Medical Informatics Association. 2006.

    Google Scholar 

  14. Benson T, Grieve G. LOINC. In: Principles of health interoperability: FHIR, HL7 and SNOMED CT. Cham: Springer International Publishing; 2021. p. 325–338.

    Google Scholar 

  15. 2020 LOINC Annual Report. 2023. Available from: https://loinc.org/annual-reports/year-2020/.

  16. LOINC International. 2023. Available from: https://loinc.org/international/.

  17. Logical Observation Identifier Names and Codes (LOINC). 2023. Available from: https://loinc.org/oids/2.16.840.1.113883.6.1/.

  18. Bodenreider O, Cornet R, Vreeman DJ. Recent developments in clinical terminologies—SNOMED CT, LOINC, and RxNorm. Yearb Med Inform. 2018;27(01):129–39.

    Article  PubMed  PubMed Central  Google Scholar 

  19. Zunner C, et al. Mapping local laboratory interface terms to LOINC at a German university hospital using RELMA V. 5: a semi-automated approach. J Am Med Inform Assoc. 2013;20(2):293–297.

    Google Scholar 

  20. Yeh C-Y, et al. Logical observation identifiers names and codes (Loinc®) applied to microbiology: a national laboratory mapping experience in Taiwan. Diagnostics. 2021;11(9):1564.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Kopanitsa G. Application of a Regenstrief RELMA V. 6.6 to map Russian laboratory terms to LOINC. Methods Inf Med. 2016;55(02):177–181.

    Google Scholar 

  22. Huser V, Taft LM, Cimino JJ. Suitability of LOINC document ontology as a reference terminology for clinical document types: a case report of a research-oriented EHR. 2023. Available from: https://lhncbc.nlm.nih.gov/LHC-publications/PDF/pub2012-072.pdf.

  23. SNOMED International. 2023. Available from: https://www.snomed.org/.

  24. Cornet R, de Keizer N. Forty years of SNOMED: a literature review. BMC Med Inform Decis Mak. 2008;8(1):1–6.

    Google Scholar 

  25. Overview of SNOMED CT. 2023. Available from: https://www.nlm.nih.gov/healthit/snomedct/snomed_overview.html.

  26. SNOMED CT Introduction: Structure of Domain Coverage. 2023. Available from: https://confluence.ihtsdotools.org/display/DOCEG/Structure+of+Domain+Coverage.

  27. Khorrami F, Ahmadi M, Sheikhtaheri A. Evaluation of SNOMED CT content coverage: a systematic literature review. eHealth, 2018;212–219.

    Google Scholar 

  28. Editorial, ICD‐11. Lancet. 2019;393:2275.

    Google Scholar 

  29. International Statistical Classification of Diseases and Related Health Problems (ICD). 2023. Available from: https://www.who.int/standards/classifications/classification-of-diseases.

  30. Park H-A, Hardiker N. Clinical terminologies: a solution for semantic interoperability. J Korean Soc Med Inform. 2009;15(1):1–11.

    Article  CAS  Google Scholar 

  31. Jetté N, et al. The development, evolution, and modifications of ICD-10: challenges to the international comparability of morbidity data. Med Care. 2010;1105–1110.

    Google Scholar 

  32. Perotte A, et al. Diagnosis code assignment: models and evaluation metrics. J Am Med Inform Assoc. 2014;21(2):231–7.

    Article  PubMed  Google Scholar 

  33. Pérez A, et al. Inferred joint multigram models for medical term normalization according to ICD. Int J Med Informatics. 2018;110:111–7.

    Article  Google Scholar 

  34. Wang Q, et al. A study of entity-linking methods for normalizing Chinese diagnosis and procedure terms to ICD codes. J Biomed Inform. 2020;105:103418.

    Article  PubMed  Google Scholar 

  35. Introductory Guide MedDRA Version 26.0. 2023. Available from: https://www.meddra.org/how-to-use/support-documentation/english.

  36. Medical Subject Headings. 2023. Available from: https://www.nlm.nih.gov/mesh/intro_preface.html.

  37. MeSH Record Types. 2023. Available from: https://www.nlm.nih.gov/mesh/intro_record_types.html.

  38. The Gene Ontology Resource. 2023. Available from: http://geneontology.org/.

  39. Consortium GO. The gene ontology resource: 20 years and still GOing strong. Nucleic Acid Res. 2019;47(D1):D330–D338.

    Google Scholar 

  40. Gene Ontology overview. [cited 2023 July 24]; Available from: http://geneontology.org/docs/ontology-documentation/.

  41. Saxena R, Bishnoi R, Singla D. Gene ontology: application and importance in functional annotation of the genomic data. In: Bioinformatics. Elsevier; 2022. p. 145–57.

    Chapter  Google Scholar 

  42. Role of gene ontology in bioinformatics and bioremediation studies. 2023. Available from: https://www.projectguru.in/gene-ontology-bioremediation/.

  43. Smith B, et al. The OBO foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol. 2007;25(11):1251–5.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. OBO Foundry, Principles: Overview. 2023. Available from: http://obofoundry.org/principles/fp-000-summary.html.

  45. Open Biological and Biomedical Ontology Foundry, Community development of interoperable ontologies for the biological sciences. 2023. Available from: http://obofoundry.org/.

  46. Aronson AR. Metamap: mapping text to the umls metathesaurus, vol. 1. Bethesda, MD: NLM, NIH, DHHS; 2006. p. 26.

    Google Scholar 

  47. Xu H, et al. MedEx: a medication information extraction system for clinical narratives. J Am Med Inform Assoc. 2010;17(1):19–24.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Zhou L, et al. Mapping partners master drug dictionary to RxNorm using an NLP-based approach. J Biomed Inform. 2012;45(4):626–33.

    Article  PubMed  Google Scholar 

  49. RELMA version 7.0 Users’ manual. 2023. Available from: https://loinc.org/kb/relma/overview/.

  50. Dong X, et al. COVID-19 TestNorm: a tool to normalize COVID-19 testing names to LOINC codes. J Am Med Inform Assoc. 2020;27(9):1437–42.

    Article  PubMed  PubMed Central  Google Scholar 

  51. Gaudet-Blavignac C, et al. Use of the systematized nomenclature of medicine clinical terms (SNOMED CT) for processing free text in health care: systematic scoping review. J Med Internet Res. 2021;23(1):e24594.

    Article  PubMed  PubMed Central  Google Scholar 

  52. Chen P-F, et al. Automatic ICD-10 coding and training system: deep neural network based on supervised learning. JMIR Med Inform. 2021;9(8):e23230.

    Article  PubMed  PubMed Central  Google Scholar 

  53. Chraibi A, et al. A deep learning framework for automated ICD-10 coding. In: MIE. 2021.

    Google Scholar 

  54. Ly T, et al. Evaluation of natural language processing (NLP) systems to annotate drug product labeling with MedDRA terminology. J Biomed Inform. 2018;83:73–86.

    Article  PubMed  Google Scholar 

  55. MeSH on Demand. 2023. Available from: https://www.nlm.nih.gov/oet/ed/mesh/meshondemand.html.

  56. Beasley L, Manda P. Comparison of natural language processing tools for automatic gene ontology annotation of scientific literature. PeerJ Preprints. 2018.

    Google Scholar 

  57. Friedman C, et al. Automated encoding of clinical documents based on natural language processing. J Am Med Inform Assoc. 2004;11(5):392–402.

    Article  PubMed  PubMed Central  Google Scholar 

  58. Pratt AW, Pacak MG. Automated processing of medical English. In: International conference on computational linguistics COLING 1969: preprint no. 11. 1969.

    Google Scholar 

  59. Ji H, et al. Overview of the TAC 2010 knowledge base population track. In: Third text analysis conference (TAC 2010). 2010.

    Google Scholar 

  60. Hachey B, et al. Evaluating entity linking with wikipedia. Artif Intell. 2013;194:130–50.

    Article  Google Scholar 

  61. Zhang W, et al. I2R-NUS-MSRA at TAC 2011: entity linking. In: TAC. 2011.

    Google Scholar 

  62. Miller GA. WordNet: a lexical database for English. Commun ACM. 1995;38(11):39–41.

    Article  Google Scholar 

  63. Dolan W, et al. Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. 2004.

    Google Scholar 

  64. Bodenreider, O., The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acid Res. 2004;32(suppl_1):D267–D270.

    Google Scholar 

  65. Lehmann J, et al. LCC approaches to knowledge base population at TAC 2010. In: TAC. 2010.

    Google Scholar 

  66. Monahan S, et al. Cross-lingual cross-document coreference with entity linking. In: TAC. 2011.

    Google Scholar 

  67. Dredze M, et al. Entity disambiguation for knowledge base population. In: Proceedings of the 23rd international conference on computational linguistics. 2010.

    Google Scholar 

  68. Kulkarni S, et al. Collective annotation of wikipedia entities in web text. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. 2009.

    Google Scholar 

  69. Zhang Y, et al. UTH_CCB: a report for semeval 2014–task 7 analysis of clinical text. In: Proceedings of the 8th INTERNATIONAL WORKSHOP ON SEMANTIC EVAluation (SemEval 2014). 2014.

    Google Scholar 

  70. Wu H-Y, et al. An integrated pharmacokinetics ontology and corpus for text mining. BMC Bioinformatics. 2013;14:1–15.

    Article  CAS  Google Scholar 

  71. Xu J, et al. UTH-CCB@ BioCreative V CDR task: identifying chemical-induced disease relations in biomedical text. In: Proceedings of the fifth biocreative challenge evaluation Workshop. 2015.

    Google Scholar 

  72. Li H, et al. CNN-based ranking for biomedical entity normalization. BMC Bioinformatics. 2017;18:79–86.

    Article  Google Scholar 

  73. Zheng Z, et al. Learning to link entities with knowledge base. In: Human language technologies: the 2010 annual conference of the North American chapter of the association for computational linguistics. 2010.

    Google Scholar 

  74. Hoffart J, et al. Robust disambiguation of named entities in text. In: Proceedings of the 2011 conference on empirical methods in natural language processing. 2011.

    Google Scholar 

  75. Liu T-Y. Learning to rank for information retrieval. Found Trends Inf Retrieval. 2009;3(3):225–331.

    Google Scholar 

  76. Li H. Learning to rank for information retrieval and natural language processing. Springer Nature; 2022.

    Google Scholar 

  77. Leaman R, Islamaj Doğan R, Lu Z, DNorm: disease name normalization with pairwise learning to rank. Bioinformatics. 2013;29(22):2909–2917.

    Google Scholar 

  78. Zhang W, et al. Entity linking with effective acronym expansion, instance selection and topic modeling. In: Twenty-Second international joint conference on artificial intelligence. 2011.

    Google Scholar 

  79. Han X, Sun L, Zhao J. Collective entity linking in web text: a graph-based method. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval. 2011.

    Google Scholar 

  80. Ji Z, et al. Joint recognition and linking of fine-grained locations from tweets. In: Proceedings of the 25th international conference on world wide web. 2016.

    Google Scholar 

  81. Schumacher E, Mulyar A, Dredze M. Clinical concept linking with contextualized neural representations. In: Proceedings of the 58th annual meeting of the association for computational linguistics. 2020.

    Google Scholar 

  82. Xu D, Zhang Z, Bethard S. A generate-and-rank framework with semantic type regularization for biomedical concept normalization. In: Proceedings of the 58th annual meeting of the association for computational linguistics. 2020.

    Google Scholar 

  83. CLEF: Conference and Labs of the Evaluation Forum. 2023. Available from: http://www.clef-initiative.eu/.

  84. The 3rd Clinical Natural Language Processing Workshop, EMNLP 2020. 2020. Available from: https://clinical-nlp.github.io/2020/resources.html.

  85. SENSEVAL. 2023. Available from: https://web.eecs.umich.edu/~mihalcea/senseval/.

  86. SemEval-2014 Task 7: analysis of clinical text. 2023. Available from: https://alt.qcri.org/semeval2014/task7/.

  87. Henry S, et al. The 2019 National Natural language processing (NLP) Clinical Challenges (n2c2)/Open Health NLP (OHNLP) shared task on clinical concept normalization for clinical records. J Am Med Inform Assoc JAMIA. 2020;27(10):1529–37.

    PubMed  Google Scholar 

  88. Blaschke C, et al. A critical assessment of text mining methods in molecular biology. BMC Bioinformatics. 2004;6.

    Google Scholar 

  89. Morgan AA, et al. Overview of biocreative II gene normalization. Genome Biol. 2008;9:1–19.

    Article  Google Scholar 

  90. Maglott D, et al. Entrez gene: gene-centered information at NCBI. Nucleic Acid Res. 2005:33(suppl_1):D54–D58.

    Google Scholar 

  91. Leitner F, et al. An overview of BioCreative II. 5. IEEE/ACM Trans Comput Biol Bioinf. 2010;7(3):385–399.

    Google Scholar 

  92. UniProt: the universal protein knowledgebase in 2021. Nucleic Acid Res. 2021;49(D1):D480–D489.

    Google Scholar 

  93. Lu Z, et al. The gene normalization task in BioCreative III. BMC Bioinformatics. 2011;12:1–19.

    Article  Google Scholar 

  94. Carroll HD, et al. Threshold average precision (TAP-k): a measure of retrieval designed for bioinformatics. Bioinformatics. 2010;26(14):1708–13.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  95. Leaman R, Islamaj R, Lu, Z. The overview of the NLM-Chem BioCreative VII track.

    Google Scholar 

  96. Index of /pub/lu/BC7-NLM-Chem-track. 2023. Available from: https://ftp.ncbi.nlm.nih.gov/pub/lu/BC7-NLM-Chem-track/.

  97. Doğan RI, Leaman R, Lu Z. NCBI disease corpus: a resource for disease name recognition and concept normalization. J Biomed Inform. 2014;47:1–10.

    Article  PubMed  PubMed Central  Google Scholar 

  98. Mohan S, Li D. Medmentions: a large biomedical corpus annotated with umls concepts. 2019. arXiv preprint arXiv:1902.09476.

  99. The ST21pv subset of the full MedMentions dataset. [cited 2023 Sept]. Available from: https://github.com/chanzuckerberg/MedMentions/tree/master/st21pv.

  100. Roberts K, Demner-Fushman D, Tonning JM. Overview of the TAC 2017 adverse reaction extraction from drug labels track. In: TAC. 2017.

    Google Scholar 

  101. Bayer S, et al. ADE eval: an evaluation of text processing systems for adverse event extraction from drug labels for pharmacovigilance. Drug Saf. 2021;44:83–94.

    Article  PubMed  Google Scholar 

  102. Sarker A, Gonzalez-Hernandez G. Overview of the second social media mining for health (SMM4H) shared tasks at AMIA 2017. Training. 2017;1(10,822):1239.

    Google Scholar 

  103. Limsopatham N, Collier N. Adapting phrase-based machine translation to normalise medical terms in social media messages. 2015. arXiv preprint arXiv:1508.02285.

  104. Limsopatham N, Collier N. Normalising medical concepts in social media texts by learning semantic representation. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers). 2016.

    Google Scholar 

  105. Zolnoori M, et al. The PsyTAR dataset: from patients generated narratives to a corpus of adverse drug events and effectiveness of psychiatric medications. Data Brief. 2019;24:103838.

    Article  PubMed  PubMed Central  Google Scholar 

  106. Karimi S, et al. Cadec: a corpus of adverse drug event annotations. J Biomed Inform. 2015;55:73–81.

    Article  PubMed  Google Scholar 

  107. Belousov M, Dixon WG, Nenadic G. Mednorm: a corpus and embeddings for cross-terminology medical concept normalisation. In: Proceedings of the fourth social media mining for health applications (# SMM4H) workshop & shared task. 2019.

    Google Scholar 

  108. Alvaro N, Miyao Y, Collier N. TwiMed: twitter and PubMed comparable corpus of drugs, diseases, symptoms, and their relations. JMIR Public Health Surveill. 2017;3(2):e6396.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hua Xu .

Editor information

Editors and Affiliations

Glossary

Medical concept normalization

A process in healthcare informatics and natural language processing (NLP) that involves standardizing and mapping medical terms mentioned in free text to standardized concept ides (or codes) in controlled medical terminologies or ontologies.

Common data model (CDM)

A standardized, structured, and unified way of organizing and representing data from diverse sources in a consistent format.

Controlled vocabularies

A finite, enumerated set of terms intended to convey information unambiguously.

Observational Health Data Science and Informatics (OHDSI)

A multi-stakeholder, interdisciplinary collaborative to bring out the value of health data through large-scale analytics.

Terminology

A set of terms representing the system of concepts of a particular subject field.

Ontology

A description (like a formal specification of a program) of the concepts and relationships that can exist for an agent or a community of agents. In biomedicine, such ontologies typically specify the meanings and hierarchical relationships among terms and concepts in a domain.

Interoperability

The ability of different systems, applications, or components to seamlessly communicate, exchange data, and work together effectively.

Semantic similarity

A measure of how similar or related the meanings of two words, phrases, sentences, or documents are.

Entity linking

A step of natural language processing (NLP), after finding a named entity in a document, for linking (normalizing) that entity to an appropriate entry in a database. Medical concept normalization is a special case of entity linking.

Lexical variation

A linguistic phenomenon in which different contexts use different words or expressions to refer to the same concept.

Polysemy

A linguistic phenomenon where a single word or phrase has multiple related meanings or senses

Granularity

The level of detail or specificity at which meaning is represented in language or knowledge.

Unified Medical Language System (UMLS)

A terminology system, developed under the direction of the National Library of Medicine, to produce a common structure that ties together the various vocabularies that have been created for biomedical domains.

URI/Identifier

Uniform resource identifier (URI) refers to the combination of a URN and URL, intended to provide persistent access to digital objects.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Xu, H., Demner Fushman, D., Hong, N., Raja, K. (2024). Medical Concept Normalization. In: Xu, H., Demner Fushman, D. (eds) Natural Language Processing in Biomedicine. Cognitive Informatics in Biomedicine and Healthcare. Springer, Cham. https://doi.org/10.1007/978-3-031-55865-8_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-55865-8_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-55864-1

  • Online ISBN: 978-3-031-55865-8

  • eBook Packages: MedicineMedicine (R0)

Publish with us

Policies and ethics