Unit testing a PDF document in JavaScript or TypeScript
PDF is very complex. While technically it is a text format, and you can open it with any text editor, it may be confusing when trying to read it. There might be numbers between the strings and other seemingly random elements (they are not, though). How can we check if a PDF contains a specific string then? pdf.js-extract
comes to the rescue.
Extraction
Let’s install the required packages first.
1 |
npm install pdf.js-extract chai --save-dev |
For testing, we will use chai
but you can use whichever framework you like.
In order to get a text from a PDF, we will have to load it first.
1 2 3 4 |
import {PDFExtract, PDFExtractPage} from 'pdf.js-extract'; import {expect} from 'chai'; const pdfStructure = await new PDFExtract().extract(filePath, {}); |
To access the content, you have to go through the pages
property, an array of PDFExtractPage
objects. PDFExtractPage
has a content
property, an array of PDFExtractText
and will allow us to search for desired text.
Going through multiple arrays may be tedious since we would have to first iterate over pages
, then content
, and finally compare against a string we are looking for or another entire array of strings.
Let’s make a helper function that will merge all of the strings in the content
property into one string so we can check for specific substrings.
1 2 3 |
function textOf(page: PDFExtractPage): string { return page.content.map(({str}) => str).join(' '); } |
Now we can check for specific strings in a page.
1 2 3 |
const pageText = textOf(pdfStructure.pages[0]); expect(pageText).to.include('First text'); expect(pageText).to.include('Second text'); |
Summary
pdf.js-extract
while fairly simple in its current form, allows verifying a PDF document without having to take care of the complex nature of PDF format itself. Currently, it only supports texts and their metadata (position, size, etc.) and is quite slow, unfortunately. Even the smallest documents can take a second to process. Fortunately, this is less of an issue in unit tests than in production code. It will be interesting to see how pdf.js-extract
develops over time.
Feedback is welcome!
Want to join the discussion?Feel free to contribute!
Leave a Reply
Want to join the discussion?Feel free to contribute!