[SYS // OPEN SOURCE]
doc2json
Define your schema. Extract your data. Write to your database.
pip install doc2json
01
- You define the schema, not a fixed set of fields, but the exact structure your business needs. doc2json helps you build it interactively.
- Extracts structured data from PDFs, Word documents, scanned images, and HTML. Handles tables, nested fields, and multi-page layouts.
- Every extraction is validated against your schema and grounded back to the source document. You can see exactly where each field came from.
- You choose where the AI runs: locally with Ollama, in your enterprise cloud, or via public APIs. Your data, your rules.
- Outputs clean JSON ready for pipelines, databases, or warehouses. Built-in connectors to Snowflake, BigQuery, PostgreSQL, and more.
Need Help Getting Started?
We built doc2json and use it in production every day.
If you need help with
schema design, deployment, or connecting it to your infrastructure, just reach out.