Unit testing a PDF document in JavaScript or TypeScript

PDF is very complex. While technically it is a text format, and you can open it with any text editor, it may be confusing when trying to read it. There might be numbers between the strings and other seemingly random elements (they are not, though). How can we check if a PDF contains a specific string then? pdf.js-extract comes to the rescue.

Extraction

Let’s install the required packages first.

For testing, we will use chai but you can use whichever framework you like.

In order to get a text from a PDF, we will have to load it first.

To access the content, you have to go through the pages property, an array of PDFExtractPage objects. PDFExtractPage has a content property, an array of PDFExtractText and will allow us to search for desired text.
Going through multiple arrays may be tedious since we would have to first iterate over pages, then content, and finally compare against a string we are looking for or another entire array of strings.
Let’s make a helper function that will merge all of the strings in the content property into one string so we can check for specific substrings.

Now we can check for specific strings in a page.

Summary

pdf.js-extract while fairly simple in its current form, allows verifying a PDF document without having to take care of the complex nature of PDF format itself. Currently, it only supports texts and their metadata (position, size, etc.) and is quite slow, unfortunately. Even the smallest documents can take a second to process. Fortunately, this is less of an issue in unit tests than in production code. It will be interesting to see how pdf.js-extract develops over time.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *