Skip to content
Go back

Build a Bank Statement Extractor with JSON Array Transactions

This tutorial shows how to model bank statements in PdfParse using a hybrid approach: store statement-level metadata as structured columns while capturing individual transactions in a JSON array field. This design gives you the best of both worlds - strongly-typed statement data for reliable queries plus flexible transaction storage that adapts to varying formats across different banks.

Demo placeholder (swap in your own clip):
<video controls width="100%" poster="/blog/placeholders/bank-demo-poster.png">
<source src="/blog/bank-statement.webm" type="video/webm" />
<source src="/blog/bank-statement.mp4" type="video/mp4" />
</video>

1) Schema: parent + array field

Copyable schema (table)

bank_statement

keytypeprompt
account_holderstring”Name of the account holder”
account_numberstring”Masked account number if present”
statement_period_startdate”Start date of the statement period”
statement_period_enddate”End date of the statement period”
opening_balancenumber”Opening balance for the period”
closing_balancenumber”Closing balance for the period”
purchases_jsonarray”Array of transactions: objects with date, merchant, amount, etc.”

Copyable schema (JSON)

{
  "bank_statement": {
    "fields": [
      { "key": "account_holder", "type": "string", "prompt": "Name of the account holder" },
      { "key": "account_number", "type": "string", "prompt": "Masked account number if present" },
      { "key": "statement_period_start", "type": "date", "prompt": "Start date of the statement period" },
      { "key": "statement_period_end", "type": "date", "prompt": "End date of the statement period" },
      { "key": "opening_balance", "type": "number", "prompt": "Opening balance for the period" },
      { "key": "closing_balance", "type": "number", "prompt": "Closing balance for the period" },
      {
        "key": "purchases_json",
        "type": "array",
        "prompt": "Array of transactions; each object has date, merchant, amount, category, memo, payment_method"
      }
    ]
  }
}

2) Why use an array for transactions

Flexibility meets simplicity: Bank statements vary widely across institutions - some include merchant categories, others show memo fields, and payment methods differ by account type. A JSON array handles this variability gracefully:

For many use cases - especially reconciliation, accounting imports, or basic analytics - JSON arrays provide the perfect balance between structure and flexibility.

3) Array configuration tips

Prompt engineering matters: To get clean, consistent transaction data, follow these best practices when configuring your purchases_json array field:

Example prompt for purchases_json:

"Array of all transactions on this statement. Each transaction should be an object with: date (YYYY-MM-DD), merchant (string), amount (number, positive for credits/negative for debits), category (optional string), memo (optional string), payment_method (optional: 'debit', 'check', 'ATM', etc.). Omit summary rows and headers."

4) Querying JSON arrays in SQLite

SQLite’s JSON support is powerful: Once your data is extracted, you can query transaction arrays using SQLite’s built-in JSON functions. Here are practical examples:

Extract all transactions for a specific statement:

SELECT
  json_extract(p.value, '$.date') AS transaction_date,
  json_extract(p.value, '$.merchant') AS merchant,
  json_extract(p.value, '$.amount') AS amount,
  json_extract(p.value, '$.category') AS category
FROM bank_statement bs,
     json_each(bs.purchases_json) AS p
WHERE bs.id = ?;

Find all transactions above a threshold:

SELECT
  bs.account_holder,
  json_extract(p.value, '$.date') AS date,
  json_extract(p.value, '$.merchant') AS merchant,
  CAST(json_extract(p.value, '$.amount') AS REAL) AS amount
FROM bank_statement bs,
     json_each(bs.purchases_json) AS p
WHERE CAST(json_extract(p.value, '$.amount') AS REAL) > 1000.00;

Aggregate spending by category across all statements:

SELECT
  json_extract(p.value, '$.category') AS category,
  COUNT(*) AS transaction_count,
  SUM(CAST(json_extract(p.value, '$.amount') AS REAL)) AS total_spent
FROM bank_statement bs,
     json_each(bs.purchases_json) AS p
WHERE json_extract(p.value, '$.category') IS NOT NULL
GROUP BY category
ORDER BY total_spent DESC;

Reconcile statement balances:

SELECT
  bs.statement_period_start,
  bs.statement_period_end,
  bs.opening_balance,
  bs.closing_balance,
  bs.opening_balance + SUM(CAST(json_extract(p.value, '$.amount') AS REAL)) AS calculated_balance,
  bs.closing_balance - (bs.opening_balance + SUM(CAST(json_extract(p.value, '$.amount') AS REAL))) AS discrepancy
FROM bank_statement bs,
     json_each(bs.purchases_json) AS p
GROUP BY bs.id;

5) Field mapping checklist

Ensure complete coverage: When setting up your bank statement schema, make sure you capture these essential fields:

Statement-level metadata (scalar columns):

Transaction array (purchases_json): Each transaction object should include:

Type validation tip: Always specify that amounts should be numeric values. Banks often format amounts with currency symbols ($, €) or thousand separators (1,234.56) which can confuse extraction if not explicitly handled.

6) Testing and validation

Start small, iterate fast: Before processing hundreds of statements, validate your schema with a diverse sample set:

Initial validation:

  1. Process 3-5 sample statements from different banks/formats to test schema flexibility
  2. Export as SQLite and inspect the raw data - verify that purchases_json contains valid JSON arrays
  3. Check data types: Run SELECT typeof(amount) FROM (SELECT json_extract(value, '$.amount') as amount FROM bank_statement, json_each(purchases_json)) to confirm amounts are stored as numbers, not strings
  4. Review edge cases: Look for statements with refunds, fees, interest payments, or international transactions

Balance reconciliation:

-- This query should show zero or near-zero discrepancy
SELECT
  account_holder,
  opening_balance,
  closing_balance,
  (SELECT SUM(CAST(json_extract(p.value, '$.amount') AS REAL))
   FROM json_each(purchases_json) p) AS total_transactions,
  closing_balance - (opening_balance +
    (SELECT SUM(CAST(json_extract(p.value, '$.amount') AS REAL))
     FROM json_each(purchases_json) p)) AS discrepancy
FROM bank_statement;

If discrepancies appear, common causes include:

Category validation: If merchant categories come out inconsistent or noisy, refine your prompt:

7) When to use arrays vs child tables

Choose the right tool for your use case:

Use JSON arrays when:

Migrate to child tables when:

See our guide on building invoice extractors with child tables for the relational approach. You can always start with arrays and migrate to child tables later - PdfParse makes both patterns easy to implement.

8) Download your SQLite database

Once you’ve processed your bank statements and reviewed the extractions, you can download the complete SQLite database containing all your statement data including the JSON transaction arrays.

How to download:

  1. Navigate to Dashboard → Projects → [Your Bank Statement Project]
  2. Click on the Downloads tab in the navigation
  3. Click the “Download Database” button
  4. Save the .sqlite file to your local machine

Download demo placeholder (swap in your own clip): <video controls width="100%" poster="/blog/placeholders/download-demo-poster.png"> <source src="/blog/sqlite-download.webm" type="video/webm" /> <source src="/blog/sqlite-download.mp4" type="video/mp4" /> </video>

Breadcrumb navigation: Dashboard → Projects → [Your Project] → Downloads → Download Database

The downloaded SQLite file contains your bank_statement table with the purchases_json array field fully populated. You can immediately:

See our detailed guide on integrating SQLite databases into your workflow for Python examples, API patterns, and advanced querying techniques.


Get started with bank statement extraction

Ready to automate your bank statement processing? Create an account and set up this template in minutes. Our product walkthrough shows the full extraction workflow in the launch article.


Continue learning about structured document extraction:

Need help with edge cases? Contact us at support@pdfparse.net if you’re handling:

We’ll help you refine your schema and prompts to handle even the trickiest statement formats.



Previous Post
Introducing PdfParse: Transform Documents into Structured Databases
Next Post
Build an Invoice Extractor with Child Tables and Foreign Keys