OCR defects - a running list of files with OCR issues

Is it helpful to keep all the files with OCR errors in one post? Wasn’t sure where you’d prefer us to report them, so let me know if it should be posted elsewhere :slightly_smiling_face:

e.g. EFTA02292933 - Epstein Files | Epstein Exposed vs DOJ file EFTA02292933 : OCR didn’t pick up the Sent date and added a ‘t’ after the sender name (it’s a redacted portion) - I’ve seen other OCRs ‘misread’ the redaction bars.

Great idea keeping a running thread for OCR issues. This is exactly the right place to report them.

When you spot something, just drop the EFTA ID and a quick description of what’s wrong. I can correct entries directly in the database or flag pages for re-extraction if the OCR is garbled enough to warrant it.

For EFTA02292933, the phantom ‘t’ from the redaction bar is a known issue with heavily redacted pages. The OCR engine reads the edge of the black bar as a character. I’ll flag it for manual correction.

I’m making this a wiki post so anyone can edit and add to the list directly.

1 Like

Quick update on this one: EFTA02292933 is in the database, stored under the ID efta-02292933 (DS10 documents use a shorter ID format than the DS9 efta-efta prefix). I’ve flagged it for OCR correction.

View on site