Nougat: Neural Optical Understanding for Academic Documents Lukas Blecher* Guillem Cucurull Thomas Scialom Robert Stojnic Meta AI Abstract Scientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat ( Neural Optical Understanding for Academic Documents), a Visual rXiv:2308.13418v1 [cs.LG] 25 Aug 2023 Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents. The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age, by bridging the gap between human- readable documents and machine-readable text. We release the models and code to accelerate future work on scientific text recognition. Introduction 1 The majority of scientific knowledge is stored in books or published in scientific journals, most commonly in the Portable Document Format (PDF). Next to HTML, PDFs are the second most prominent data format on the internet, making up 2.4% of common crawl [ 1 ]. However, the information stored in these files is very difficult to extract into any other formats. This is especially true for highly specialized documents, such as scientific research papers, where the semantic information of mathematical expressions is lost. Existing Optical Character Recognition (OCR) engines, such as Tesseract OCR [ 2 ], excel at detecting and classifying individual characters and words in an image, but fail to understand the relationship between them due to their line-by-line approach. This means that they treat superscripts and subscripts in the same way as the surrounding text, which is a significant drawback for mathematical expressions. In mathematical notations like fractions, exponents, and matrices, relative positions of characters are crucial. Converting academic research papers into machine-readable text also enables accessibility and searchability of science as a whole. The information of millions of academic papers can not be fully accessed because they are locked behind an unreadable format. Existing corpora, such as the S2ORC dataset [ 3 ], capture the text of 12M 2 papers using GROBID [ 4 ], but are missing meaningful representations of the mathematical equations. To this end, we introduce Nougat, a transformer based model that can convert images of document pages to formatted markup text. The primary contributions in this paper are • Release of a pre-trained model capable of converting a PDF to a lightweight markup language. We release the code and the model on GitHub 3 • We introduce a pipeline to create dataset for pairing PDFs to source code • Our method is only dependent on the image of a page, allowing access to scanned papers and books ° Correspondence to: Iblecher@meta.com 2 The paper reports 8.1M papers but the authors recently updated the numbers on the GitHub page https://github.com/allenai/s2orc 3 https://github.com/facebookresearch/nougat