Hi @Omerta_Archetype, great questions — and no need to apologize for the edits, we appreciate you taking the time (especially with 7 kids around
). Let me address each one.
1. Consolidated Lists of Names, Emails, Companies, Dates
We actually have a lot of this data already in the system, just not surfaced as cleanly as it could be:
- 1,500+ person profiles at /persons — searchable, filterable by category (associate, victim, legal, financial, etc.), with cross-references to every document, flight, and email they appear in.
- 10,000+ emails at /emails — browsable with sender/recipient filtering and full-text search.
- 1,700+ flight logs at /flights — searchable by passenger name, date, origin/destination.
- Extracted entities — our AI pipeline has already extracted names, dates, and organizations from EFTA documents and stored them in a structured database.
What we’re missing is a dedicated “entity explorer” page that aggregates all of this into one view — e.g., “show me every company mentioned across all documents” or “show me all dates associated with this person.” That’s a great feature request and something we’re planning to build out. In the meantime, the co-occurrence tool shows which names appear together most frequently across documents.
2. Feedback Option While Reviewing
This already exists! Our Document Review System lets any signed-in user:
- Rate documents (significant, routine, junk, unreadable, needs expert review)
- Tag categories (Financial, Legal, Correspondence, Flight Related, Victim Testimony, etc.)
- Spot persons — search our database and tag people you recognize in a document
- Write key findings and notes — flag important discoveries in free-text
After 3+ reviewers reach consensus on a document, it gets marked accordingly. Significant discoveries with key findings automatically post to the community feed.
What we could improve: right now, if you spot a new name that isn’t already in our person database, you’d need to use the submission form to suggest adding them. We’re working on making that inline — so you can flag new entities (names, companies, email addresses) directly during review without leaving the page.
3. Auto-Tagging Documents
Yes — we already do this to a significant degree:
- Our system automatically scans every document’s title, summary, and OCR text for known person names and creates links. That’s how we have 1.5 million+ document-person connections in the database.
- For EFTA documents specifically, we’ve run AI entity extraction that pulls out names, dates, organizations, and other structured data.
- Every document also has a full-text search index (tsvector) built from its title, summary, and OCR text.
The goal you’re describing — where reviewers primarily audit and correct what the system already found rather than starting from scratch — is exactly the direction we’re heading. Think of it as: AI does the first pass, humans verify and add what was missed.
4. The “=” Symbol in DOJ Documents
Great question. The symbol you’re seeing is an artifact of base64 encoding — and you’re right that these aren’t simply scanned originals.
Here’s what happened: many of the EFTA documents are actually emails with attachments. The DOJ’s process for releasing them was, unfortunately, destructive:
- They printed the raw email files — including the full MIME/binary encoding of any PDF attachments — onto paper
- Then scanned those printouts back as PDFs with OCR
When email attachments are transmitted, they’re encoded as base64 (a way to represent binary data as text characters). The at the end of base64 lines is padding — it’s a mathematical requirement of the encoding scheme, not meaningful content.
So when you see pages of seemingly random characters ending in , you’re looking at what was originally a PDF attachment (like an invitation, a receipt, a letter) that got printed in its raw encoded form instead of being rendered properly.
The good news: researchers have successfully decoded some of these back into the original PDFs. We track this on our Document Integrity page, and our system can detect and flag documents with recoverable encoded attachments.
Regarding document EFTA01772178 specifically — we do have it indexed in our database (Dataset 10). It has 4 properly applied redactions and does not appear to have the base64 attachment issue.
If you’re having trouble finding a specific DOJ document, you can search for it on our site at /documents or try the AI Research Assistant which can search across 1.6M+ documents by keyword, name, or topic.
Thanks again for the thoughtful suggestions. The vision you’re describing — certified lists, community-driven tagging, and making everything searchable — is exactly what we’re building toward. Keep the ideas coming!