How extraction works
PDFs store text as a stream of positioned characters — each glyph carries its x/y coordinates on the page, but the file has no notion of "table cells". This tool reconstructs the table in three passes:
- Read the page text. pdfjs-dist returns every text item with its position and width on the page.
- Group items into rows. Items whose y-coordinates are within a few points of each other are treated as the same row, then sorted top-to-bottom.
- Split each row into cells. Within a row we scan left-to-right; if there's a horizontal gap wider than ~8 PDF points we start a new cell, otherwise we concatenate the text.
The result is one .xlsx workbook with either a single sheet (stacked) or one sheet per page (separate), your choice.