Requesting help: trying to find / consolidate lists of important emails, names, companies, dates - to compare and add to

Good evening everyone,

I am hoping to find the most updated lists of known important emails, names, companies, dates, (along with any other relevant information), certify those lists onsite here, and I would also like to request a feedback option while reviewing - in order to pass new emails, names, dates, etc. up the chain, collectively adding to each respective list.

Further, is it possible to have the system auto register name(s), date(s) or date range(s), email addresse(s) - so that reviewers simply audit, validate, and or correct? By tagging each document with all relevant names, dates, etc., the hope is to make it easier for people to search by any means we / they can come up with.

Lastly, I have a question concerning the documents found on the doj website: what does the symbol “=“ mean? Document for reference: EFTA01772178 - apologies, I am having trouble finding doj document.

I assume this means that the documents on the doj website are not originals, but also screen copies that did not capture correctly. Please let me know, and thank you for your time.

Apologies for the edits - I have twins and am watching a total of 7 kids at the moment. :hugs:

Hi @Omerta_Archetype, great questions — and no need to apologize for the edits, we appreciate you taking the time (especially with 7 kids around :hugs:). Let me address each one.


1. Consolidated Lists of Names, Emails, Companies, Dates

We actually have a lot of this data already in the system, just not surfaced as cleanly as it could be:

  • 1,500+ person profiles at /persons — searchable, filterable by category (associate, victim, legal, financial, etc.), with cross-references to every document, flight, and email they appear in.
  • 10,000+ emails at /emails — browsable with sender/recipient filtering and full-text search.
  • 1,700+ flight logs at /flights — searchable by passenger name, date, origin/destination.
  • Extracted entities — our AI pipeline has already extracted names, dates, and organizations from EFTA documents and stored them in a structured database.

What we’re missing is a dedicated “entity explorer” page that aggregates all of this into one view — e.g., “show me every company mentioned across all documents” or “show me all dates associated with this person.” That’s a great feature request and something we’re planning to build out. In the meantime, the co-occurrence tool shows which names appear together most frequently across documents.

2. Feedback Option While Reviewing

This already exists! Our Document Review System lets any signed-in user:

  • Rate documents (significant, routine, junk, unreadable, needs expert review)
  • Tag categories (Financial, Legal, Correspondence, Flight Related, Victim Testimony, etc.)
  • Spot persons — search our database and tag people you recognize in a document
  • Write key findings and notes — flag important discoveries in free-text

After 3+ reviewers reach consensus on a document, it gets marked accordingly. Significant discoveries with key findings automatically post to the community feed.

What we could improve: right now, if you spot a new name that isn’t already in our person database, you’d need to use the submission form to suggest adding them. We’re working on making that inline — so you can flag new entities (names, companies, email addresses) directly during review without leaving the page.

3. Auto-Tagging Documents

Yes — we already do this to a significant degree:

  • Our system automatically scans every document’s title, summary, and OCR text for known person names and creates links. That’s how we have 1.5 million+ document-person connections in the database.
  • For EFTA documents specifically, we’ve run AI entity extraction that pulls out names, dates, organizations, and other structured data.
  • Every document also has a full-text search index (tsvector) built from its title, summary, and OCR text.

The goal you’re describing — where reviewers primarily audit and correct what the system already found rather than starting from scratch — is exactly the direction we’re heading. Think of it as: AI does the first pass, humans verify and add what was missed.

4. The “=” Symbol in DOJ Documents

Great question. The symbol you’re seeing is an artifact of base64 encoding — and you’re right that these aren’t simply scanned originals.

Here’s what happened: many of the EFTA documents are actually emails with attachments. The DOJ’s process for releasing them was, unfortunately, destructive:

  1. They printed the raw email files — including the full MIME/binary encoding of any PDF attachments — onto paper
  2. Then scanned those printouts back as PDFs with OCR

When email attachments are transmitted, they’re encoded as base64 (a way to represent binary data as text characters). The at the end of base64 lines is padding — it’s a mathematical requirement of the encoding scheme, not meaningful content.

So when you see pages of seemingly random characters ending in , you’re looking at what was originally a PDF attachment (like an invitation, a receipt, a letter) that got printed in its raw encoded form instead of being rendered properly.

The good news: researchers have successfully decoded some of these back into the original PDFs. We track this on our Document Integrity page, and our system can detect and flag documents with recoverable encoded attachments.

Regarding document EFTA01772178 specifically — we do have it indexed in our database (Dataset 10). It has 4 properly applied redactions and does not appear to have the base64 attachment issue.

If you’re having trouble finding a specific DOJ document, you can search for it on our site at /documents or try the AI Research Assistant which can search across 1.6M+ documents by keyword, name, or topic.


Thanks again for the thoughtful suggestions. The vision you’re describing — certified lists, community-driven tagging, and making everything searchable — is exactly what we’re building toward. Keep the ideas coming!

2 Likes

Mornin sir,

  1. #1 & #3: Understood, will await that update.
  2. #2 - Wilco on submission form :+1:
  3. #4 - Do you foresee this as an issue in the future: as we do not have access to the original documents, I imagine 98% of these 3.5m docs are meant to throw us off and take us down the wrong path. It might help to create a round table for people to discuss and suggest ideas on how to think through and efficetize how we handle and combat the foreseeable issues.

Morning. The round table idea has merit — you’re right that the volume can be overwhelming and having a structured way to separate signal from noise would help.

A few tools that might help right now:

  1. The co-occurrence tool — shows which people appear together across documents. If you’re trying to map networks, this is probably the fastest way to identify clusters worth investigating.

  2. The cross-reference tool — lets you compare two people side by side and see shared documents, flights, and connections.

  3. OCR full-text search — the search bar at the top searches across 1.5M+ document texts. If you’re looking for specific names, companies, or account numbers, it’ll pull matching documents directly.

As for the 3.5M docs being designed to throw people off — I wouldn’t frame it that way. Most of the documents are mundane (invoices, scheduling emails, property records), but that mundane stuff is exactly where the patterns hide. The researchers having the most success here (like @Redpanda’s wire payment tracking) are the ones methodically working through categories of financial records rather than hunting for smoking guns.

For the round table concept — the forum itself is probably the right place for that. If you want to start a “Research Coordination” thread where people can claim document ranges or topics they’re working on to avoid duplication, that would be genuinely useful.

Wasabi!

Is the ‘research round table’ thread in the right spot?

Also, I think I should’ve made two - name the current one as Philosophical (for the what ifs), and the other Practical (for real time implications and suggestions of researching).

Lastly, I am trying to standardize myself when reviewing by putting all relevant dates, names, emails, companies, etc in every comment box. I hope that’s helpful but please let me know if I am wasting time :melting_face:

1 Like