Skip to main content

Showing 1–1 of 1 results for author: Cosillo, M G

  1. arXiv:2309.11549  [pdf, other

    cs.DL astro-ph.IM

    Large Synthetic Data from the arXiv for OCR Post Correction of Historic Scientific Articles

    Authors: Jill P. Naiman, Morgan G. Cosillo, Peter K. G. Williams, Alyssa Goodman

    Abstract: Scientific articles published prior to the "age of digitization" (~1997) require Optical Character Recognition (OCR) to transform scanned documents into machine-readable text, a process that often produces errors. We develop a pipeline for the generation of a synthetic ground truth/OCR dataset to correct the OCR results of the astrophysics literature holdings of the NASA Astrophysics Data System (… ▽ More

    Submitted 20 September, 2023; originally announced September 2023.

    Comments: 6 pages, 1 figure, 1 table; training/validation/test datasets and all model weights to be linked on Zenodo on publication