Skip to content
Go back

Build a Resume Database: Extract and Query Hundreds of Resumes as SQLite

Managing hundreds of resume PDFs is a recruiting nightmare. You can’t search across them, can’t filter by skills or experience, and comparing candidates means opening dozens of files manually. This tutorial shows you how to transform that pile of unstructured PDFs into a queryable SQLite database using PdfParse - enabling powerful candidate searches, automated screening, and seamless ATS integration.

Demo placeholder (swap in your own clip): <video controls width="100%" poster="/blog/placeholders/resume-demo-poster.png"> <source src="/blog/resume-builder.webm" type="video/webm" /> <source src="/blog/resume-builder.mp4" type="video/mp4" /> </video>

1) Schema: parent + repeating children

A well-designed resume schema captures the hierarchical nature of candidate data:

This normalized structure lets you run powerful queries like “find all candidates with 5+ years of Python experience who worked at FAANG companies.”

Copyable schema (tables)

candidates

keytypeprompt
full_namestring”Candidate’s full name as shown on resume”
emailstring”Primary email address”
phonestring”Phone number with country code if present”
locationstring”City and state/country (e.g., ‘San Francisco, CA’)“
linkedin_urlstring”LinkedIn profile URL if present”
portfolio_urlstring”Personal website or portfolio URL if present”
summarystring”Professional summary or objective statement”
years_experiencenumber”Total years of professional work experience”
current_titlestring”Most recent or current job title”

work_experience (repeating child of candidates)

keytypeprompt
candidate_idnumber”Link to parent candidate id”
companystring”Company or organization name”
titlestring”Job title or role”
start_datedate”Start date in YYYY-MM format”
end_datestring”End date in YYYY-MM format, or ‘Present’ if current”
descriptionstring”Job responsibilities, achievements, and key projects”
technologies_usedstring”Technologies, tools, or skills used in this role”

education (repeating child of candidates)

keytypeprompt
candidate_idnumber”Link to parent candidate id”
institutionstring”School, university, or educational institution name”
degreestring”Degree type: BS, BA, MS, MBA, PhD, etc.”
field_of_studystring”Major or field of study”
graduation_yearnumber”Year of graduation (YYYY format)“
gpastring”GPA if clearly stated (e.g., ‘3.8/4.0’)”

skills (repeating child of candidates)

keytypeprompt
candidate_idnumber”Link to parent candidate id”
skill_namestring”Name of the skill, technology, or tool”
proficiency_levelstring”Proficiency if stated: beginner, intermediate, advanced, expert”

Copyable schema (JSON)

{
  "candidates": {
    "fields": [
      { "key": "full_name", "type": "string", "prompt": "Candidate's full name as shown on resume" },
      { "key": "email", "type": "string", "prompt": "Primary email address" },
      { "key": "phone", "type": "string", "prompt": "Phone number with country code if present" },
      { "key": "location", "type": "string", "prompt": "City and state/country (e.g., 'San Francisco, CA')" },
      { "key": "linkedin_url", "type": "string", "prompt": "LinkedIn profile URL if present" },
      { "key": "portfolio_url", "type": "string", "prompt": "Personal website or portfolio URL if present" },
      { "key": "summary", "type": "string", "prompt": "Professional summary or objective statement" },
      { "key": "years_experience", "type": "number", "prompt": "Total years of professional work experience" },
      { "key": "current_title", "type": "string", "prompt": "Most recent or current job title" }
    ]
  },
  "work_experience": {
    "parent": "candidates",
    "repeating": true,
    "fields": [
      { "key": "candidate_id", "type": "number", "prompt": "Link to parent candidate id" },
      { "key": "company", "type": "string", "prompt": "Company or organization name" },
      { "key": "title", "type": "string", "prompt": "Job title or role" },
      { "key": "start_date", "type": "date", "prompt": "Start date in YYYY-MM format" },
      { "key": "end_date", "type": "string", "prompt": "End date in YYYY-MM format, or 'Present' if current" },
      { "key": "description", "type": "string", "prompt": "Job responsibilities, achievements, and key projects" },
      { "key": "technologies_used", "type": "string", "prompt": "Technologies, tools, or skills used in this role" }
    ]
  },
  "education": {
    "parent": "candidates",
    "repeating": true,
    "fields": [
      { "key": "candidate_id", "type": "number", "prompt": "Link to parent candidate id" },
      { "key": "institution", "type": "string", "prompt": "School, university, or educational institution name" },
      { "key": "degree", "type": "string", "prompt": "Degree type: BS, BA, MS, MBA, PhD, etc." },
      { "key": "field_of_study", "type": "string", "prompt": "Major or field of study" },
      { "key": "graduation_year", "type": "number", "prompt": "Year of graduation (YYYY format)" },
      { "key": "gpa", "type": "string", "prompt": "GPA if clearly stated (e.g., '3.8/4.0')" }
    ]
  },
  "skills": {
    "parent": "candidates",
    "repeating": true,
    "fields": [
      { "key": "candidate_id", "type": "number", "prompt": "Link to parent candidate id" },
      { "key": "skill_name", "type": "string", "prompt": "Name of the skill, technology, or tool" },
      { "key": "proficiency_level", "type": "string", "prompt": "Proficiency if stated: beginner, intermediate, advanced, expert" }
    ]
  }
}

2) Why this schema structure

Normalized relational design unlocks powerful recruiting workflows:

This structure mirrors how recruiting databases and ATS (Applicant Tracking Systems) store candidate data - making integration seamless.

3) Schema prompts and extraction tips

Prompt engineering for clean data:

Example prompt refinements:

skills.skill_name: "Extract individual skills from the skills section. Return each skill separately (e.g., 'Python', 'React', 'AWS'). Do not include proficiency levels in the skill name."

work_experience.description: "Complete description of responsibilities and achievements for this role. Include quantifiable results if present (e.g., 'increased sales by 30%')."

years_experience: "Calculate total years of professional work experience by summing all work history. Count partial years (e.g., 6 months = 0.5 years). Exclude internships unless specifically labeled as professional roles."

4) Powerful SQL queries for recruiting

Once you’ve processed resumes and downloaded your SQLite database, these queries unlock candidate intelligence:

Find candidates with specific skill combinations:

SELECT DISTINCT
  c.full_name,
  c.email,
  c.current_title,
  c.years_experience,
  GROUP_CONCAT(DISTINCT s.skill_name) as matching_skills
FROM candidates c
JOIN skills s ON s.candidate_id = c.id
WHERE s.skill_name IN ('Python', 'React', 'AWS', 'Docker')
GROUP BY c.id
HAVING COUNT(DISTINCT s.skill_name) >= 3  -- Must have at least 3 of the 4 skills
ORDER BY c.years_experience DESC;

Search by experience level and location:

SELECT
  full_name,
  email,
  location,
  current_title,
  years_experience
FROM candidates
WHERE years_experience BETWEEN 5 AND 10
  AND location LIKE '%San Francisco%'
ORDER BY years_experience DESC;

Find candidates from target companies:

SELECT
  c.full_name,
  c.email,
  c.phone,
  we.company,
  we.title,
  we.start_date,
  we.end_date
FROM candidates c
JOIN work_experience we ON we.candidate_id = c.id
WHERE we.company IN ('Google', 'Meta', 'Amazon', 'Microsoft', 'Apple')
ORDER BY c.full_name, we.start_date DESC;

Full-text search across job descriptions:

SELECT
  c.full_name,
  c.email,
  we.company,
  we.title,
  we.description
FROM candidates c
JOIN work_experience we ON we.candidate_id = c.id
WHERE we.description LIKE '%machine learning%'
   OR we.description LIKE '%artificial intelligence%'
   OR we.technologies_used LIKE '%TensorFlow%'
   OR we.technologies_used LIKE '%PyTorch%';

Filter by education criteria:

SELECT
  c.full_name,
  c.email,
  e.degree,
  e.field_of_study,
  e.institution,
  e.graduation_year
FROM candidates c
JOIN education e ON e.candidate_id = c.id
WHERE e.degree IN ('MS', 'PhD')
  AND e.field_of_study LIKE '%Computer Science%'
ORDER BY e.graduation_year DESC;

Most in-demand skills in your candidate pool:

SELECT
  skill_name,
  COUNT(*) as candidate_count,
  ROUND(COUNT(*) * 100.0 / (SELECT COUNT(DISTINCT candidate_id) FROM skills), 2) as percentage
FROM skills
GROUP BY skill_name
ORDER BY candidate_count DESC
LIMIT 20;

Candidates with recent experience at senior levels:

SELECT DISTINCT
  c.full_name,
  c.email,
  c.years_experience,
  we.title,
  we.company,
  we.end_date
FROM candidates c
JOIN work_experience we ON we.candidate_id = c.id
WHERE (we.title LIKE '%Senior%'
   OR we.title LIKE '%Lead%'
   OR we.title LIKE '%Principal%'
   OR we.title LIKE '%Staff%')
  AND (we.end_date = 'Present' OR we.end_date >= '2023-01')
ORDER BY c.years_experience DESC;

5) Download and integrate your resume database

From extraction to automation in three steps:

  1. Process your resumes: Upload PDFs to PdfParse, process with your schema, review any extraction errors
  2. Download SQLite database: Navigate to Dashboard → [Project Name] → Downloads and click “Download Database”
  3. Integrate anywhere: Use the portable .sqlite file in Python scripts, Excel, BI tools, or custom applications

Download demo placeholder (swap in your own clip): <video controls width="100%" poster="/blog/placeholders/download-demo-poster.png"> <source src="/blog/sqlite-download.webm" type="video/webm" /> <source src="/blog/sqlite-download.mp4" type="video/mp4" /> </video>

Breadcrumb navigation: Dashboard → Projects → [Your Project] → Downloads → Download Database

The downloaded SQLite file contains all your tables with proper foreign key relationships - ready to query immediately.

6) Python integration for automated screening

Build a candidate matching engine:

import sqlite3
import pandas as pd
from typing import List, Dict

def find_matching_candidates(
    db_path: str,
    required_skills: List[str],
    min_experience: int = 0,
    target_companies: List[str] = None,
    location: str = None
) -> pd.DataFrame:
    """
    Find candidates matching specific criteria.

    Args:
        db_path: Path to the downloaded SQLite database
        required_skills: List of required skills (e.g., ['Python', 'React'])
        min_experience: Minimum years of experience
        target_companies: Optional list of companies to filter by
        location: Optional location filter (partial match)

    Returns:
        DataFrame of matching candidates with their details
    """
    conn = sqlite3.connect(db_path)

    # Build dynamic query based on filters
    query = """
    SELECT DISTINCT
        c.id,
        c.full_name,
        c.email,
        c.phone,
        c.location,
        c.current_title,
        c.years_experience,
        c.linkedin_url,
        GROUP_CONCAT(DISTINCT s.skill_name) as all_skills,
        GROUP_CONCAT(DISTINCT we.company) as companies
    FROM candidates c
    LEFT JOIN skills s ON s.candidate_id = c.id
    LEFT JOIN work_experience we ON we.candidate_id = c.id
    WHERE c.years_experience >= ?
    """

    params = [min_experience]

    if location:
        query += " AND c.location LIKE ?"
        params.append(f"%{location}%")

    query += " GROUP BY c.id"

    # Execute query
    df = pd.read_sql_query(query, conn, params=params)

    # Filter by required skills
    if required_skills:
        def has_required_skills(skills_str):
            if pd.isna(skills_str):
                return False
            skills = skills_str.lower().split(',')
            return all(
                any(req.lower() in skill for skill in skills)
                for req in required_skills
            )

        df = df[df['all_skills'].apply(has_required_skills)]

    # Filter by target companies
    if target_companies:
        def worked_at_target(companies_str):
            if pd.isna(companies_str):
                return False
            return any(
                target.lower() in companies_str.lower()
                for target in target_companies
            )

        df = df[df['companies'].apply(worked_at_target)]

    conn.close()
    return df.sort_values('years_experience', ascending=False)

# Example usage
candidates = find_matching_candidates(
    db_path='resumes.sqlite',
    required_skills=['Python', 'AWS', 'Docker'],
    min_experience=5,
    target_companies=['Google', 'Amazon', 'Microsoft'],
    location='San Francisco'
)

print(f"Found {len(candidates)} matching candidates:")
print(candidates[['full_name', 'email', 'years_experience', 'current_title']])

Export for ATS import:

import sqlite3
import csv

def export_candidates_to_csv(db_path: str, output_file: str):
    """Export all candidates to CSV for ATS import."""
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()

    cursor.execute("""
        SELECT
            c.full_name,
            c.email,
            c.phone,
            c.location,
            c.current_title,
            c.years_experience,
            c.linkedin_url,
            GROUP_CONCAT(DISTINCT s.skill_name, '; ') as skills,
            GROUP_CONCAT(DISTINCT we.company || ' (' || we.title || ')', '; ') as experience
        FROM candidates c
        LEFT JOIN skills s ON s.candidate_id = c.id
        LEFT JOIN work_experience we ON we.candidate_id = c.id
        GROUP BY c.id
        ORDER BY c.full_name
    """)

    with open(output_file, 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow([
            'Full Name', 'Email', 'Phone', 'Location',
            'Current Title', 'Years Experience', 'LinkedIn',
            'Skills', 'Work History'
        ])
        writer.writerows(cursor.fetchall())

    conn.close()
    print(f"Exported to {output_file}")

# Export all candidates
export_candidates_to_csv('resumes.sqlite', 'candidates_import.csv')

AI-powered candidate screening:

from anthropic import Anthropic
import sqlite3

def ai_screen_candidate(
    db_path: str,
    candidate_id: int,
    job_description: str
) -> Dict[str, any]:
    """
    Use Claude to evaluate candidate fit for a specific role.

    Args:
        db_path: Path to SQLite database
        candidate_id: ID of candidate to evaluate
        job_description: Full job description text

    Returns:
        Dictionary with score, reasoning, and strengths/concerns
    """
    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()

    # Get complete candidate profile
    cursor.execute("""
        SELECT
            c.*,
            GROUP_CONCAT(DISTINCT we.company || ' - ' || we.title ||
                         ' (' || we.start_date || ' to ' || we.end_date || '): ' ||
                         COALESCE(we.description, 'No description'), '\n\n') as experience,
            GROUP_CONCAT(DISTINCT s.skill_name, ', ') as skills,
            GROUP_CONCAT(DISTINCT e.degree || ' in ' || e.field_of_study ||
                         ' from ' || e.institution || ' (' || e.graduation_year || ')', '\n') as education
        FROM candidates c
        LEFT JOIN work_experience we ON we.candidate_id = c.id
        LEFT JOIN skills s ON s.candidate_id = c.id
        LEFT JOIN education e ON e.candidate_id = c.id
        WHERE c.id = ?
        GROUP BY c.id
    """, [candidate_id])

    candidate = cursor.fetchone()
    conn.close()

    if not candidate:
        return {"error": "Candidate not found"}

    # Build candidate summary
    candidate_summary = f"""
CANDIDATE PROFILE

Name: {candidate[1]}
Location: {candidate[4]}
Current Title: {candidate[9]}
Years of Experience: {candidate[8]}
Email: {candidate[2]}
LinkedIn: {candidate[5]}

SUMMARY
{candidate[7] or 'No summary provided'}

WORK EXPERIENCE
{candidate[10] or 'No work experience listed'}

SKILLS
{candidate[11] or 'No skills listed'}

EDUCATION
{candidate[12] or 'No education listed'}
"""

    client = Anthropic(api_key="your-api-key")

    message = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=2000,
        messages=[{
            "role": "user",
            "content": f"""You are an expert technical recruiter. Evaluate this candidate's fit for the following role.

JOB DESCRIPTION:
{job_description}

{candidate_summary}

Provide your evaluation in this format:
1. OVERALL FIT SCORE: [1-10]
2. KEY STRENGTHS: [bullet points]
3. POTENTIAL CONCERNS: [bullet points]
4. RECOMMENDATION: [Hire/Interview/Pass with reasoning]
5. SUGGESTED INTERVIEW QUESTIONS: [3-5 questions to validate fit]
"""
        }]
    )

    return {
        "candidate_id": candidate_id,
        "candidate_name": candidate[1],
        "evaluation": message.content[0].text
    }

# Screen a candidate
job_desc = """
Senior Backend Engineer - Python/AWS
We're seeking a senior backend engineer with 5+ years of experience building
scalable APIs using Python, AWS, and modern database technologies...
"""

result = ai_screen_candidate('resumes.sqlite', candidate_id=42, job_description=job_desc)
print(result['evaluation'])

7) Excel and BI tool integration

Connect Excel to your resume database:

  1. Download and install the SQLite ODBC Driver
  2. In Excel: Data → Get Data → From Database → From ODBC
  3. Select your downloaded .sqlite file
  4. Choose tables to import (candidates, skills, work_experience, education)
  5. Use Power Query to create relationships between tables
  6. Build pivot tables, charts, and dashboards

Power BI integration:

  1. Open Power BI Desktop
  2. Get Data → More → Database → SQLite
  3. Browse to your downloaded database file
  4. Select all tables and load
  5. Power BI auto-detects foreign key relationships
  6. Create visualizations:
    • Skills distribution (bar chart)
    • Candidates by location (map)
    • Experience level breakdown (pie chart)
    • Top companies represented (tree map)

Google Sheets integration (requires Google Apps Script):

// Google Apps Script to query SQLite via web service
function importCandidates() {
  // First, upload your SQLite to a web-accessible endpoint
  // or use a service like Datasette to expose an API

  const response = UrlFetchApp.fetch('https://your-datasette-instance.com/resumes/candidates.json?_shape=array');
  const candidates = JSON.parse(response.getContentText());

  const sheet = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet();

  // Clear existing data
  sheet.clear();

  // Write headers
  sheet.getRange(1, 1, 1, 6).setValues([[
    'Name', 'Email', 'Location', 'Title', 'Experience', 'Skills'
  ]]);

  // Write candidate data
  const rows = candidates.map(c => [
    c.full_name,
    c.email,
    c.location,
    c.current_title,
    c.years_experience,
    c.skills
  ]);

  sheet.getRange(2, 1, rows.length, 6).setValues(rows);
}

8) Building a resume search API

Simple Express/TypeScript API for internal recruiting tools:

import express from 'express';
import Database from 'better-sqlite3';
import cors from 'cors';

const app = express();
const db = new Database('resumes.sqlite', { readonly: true });

app.use(cors());
app.use(express.json());

// Search candidates by multiple criteria
app.get('/api/candidates/search', (req, res) => {
  const {
    skills,
    minExperience,
    maxExperience,
    location,
    company,
    degree
  } = req.query;

  let query = `
    SELECT DISTINCT c.*
    FROM candidates c
    WHERE 1=1
  `;
  const params: any[] = [];

  if (minExperience) {
    query += ' AND c.years_experience >= ?';
    params.push(Number(minExperience));
  }

  if (maxExperience) {
    query += ' AND c.years_experience <= ?';
    params.push(Number(maxExperience));
  }

  if (location) {
    query += ' AND c.location LIKE ?';
    params.push(`%${location}%`);
  }

  let candidates = db.prepare(query).all(...params);

  // Filter by skills (requires join)
  if (skills) {
    const skillList = (skills as string).split(',').map(s => s.trim().toLowerCase());

    candidates = candidates.filter(c => {
      const candidateSkills = db.prepare(
        'SELECT skill_name FROM skills WHERE candidate_id = ?'
      ).all(c.id);

      const candidateSkillNames = candidateSkills.map(
        (s: any) => s.skill_name.toLowerCase()
      );

      return skillList.every(requiredSkill =>
        candidateSkillNames.some(candidateSkill =>
          candidateSkill.includes(requiredSkill)
        )
      );
    });
  }

  // Filter by company experience
  if (company) {
    candidates = candidates.filter(c => {
      const experience = db.prepare(
        'SELECT company FROM work_experience WHERE candidate_id = ?'
      ).all(c.id);

      return experience.some((exp: any) =>
        exp.company.toLowerCase().includes((company as string).toLowerCase())
      );
    });
  }

  // Filter by education
  if (degree) {
    candidates = candidates.filter(c => {
      const education = db.prepare(
        'SELECT degree FROM education WHERE candidate_id = ?'
      ).all(c.id);

      return education.some((edu: any) =>
        edu.degree.toLowerCase() === (degree as string).toLowerCase()
      );
    });
  }

  res.json({
    count: candidates.length,
    candidates: candidates
  });
});

// Get full candidate profile
app.get('/api/candidates/:id', (req, res) => {
  const candidate = db.prepare('SELECT * FROM candidates WHERE id = ?').get(req.params.id);

  if (!candidate) {
    return res.status(404).json({ error: 'Candidate not found' });
  }

  const workExperience = db.prepare(
    'SELECT * FROM work_experience WHERE candidate_id = ? ORDER BY start_date DESC'
  ).all(req.params.id);

  const education = db.prepare(
    'SELECT * FROM education WHERE candidate_id = ? ORDER BY graduation_year DESC'
  ).all(req.params.id);

  const skills = db.prepare(
    'SELECT * FROM skills WHERE candidate_id = ?'
  ).all(req.params.id);

  res.json({
    ...candidate,
    work_experience: workExperience,
    education: education,
    skills: skills
  });
});

// Get skills analytics
app.get('/api/analytics/skills', (req, res) => {
  const skillStats = db.prepare(`
    SELECT
      skill_name,
      COUNT(*) as candidate_count,
      ROUND(COUNT(*) * 100.0 / (SELECT COUNT(DISTINCT candidate_id) FROM skills), 2) as percentage
    FROM skills
    GROUP BY skill_name
    ORDER BY candidate_count DESC
    LIMIT 50
  `).all();

  res.json(skillStats);
});

// Get company distribution
app.get('/api/analytics/companies', (req, res) => {
  const companyStats = db.prepare(`
    SELECT
      company,
      COUNT(DISTINCT candidate_id) as candidate_count
    FROM work_experience
    GROUP BY company
    ORDER BY candidate_count DESC
    LIMIT 30
  `).all();

  res.json(companyStats);
});

app.listen(3000, () => {
  console.log('Resume API running on http://localhost:3000');
});

Frontend search interface example (React):

import { useState } from 'react';

interface Candidate {
  id: number;
  full_name: string;
  email: string;
  location: string;
  current_title: string;
  years_experience: number;
}

export default function CandidateSearch() {
  const [results, setResults] = useState<Candidate[]>([]);
  const [filters, setFilters] = useState({
    skills: '',
    minExperience: '',
    location: '',
    company: ''
  });

  const searchCandidates = async () => {
    const queryParams = new URLSearchParams();

    if (filters.skills) queryParams.append('skills', filters.skills);
    if (filters.minExperience) queryParams.append('minExperience', filters.minExperience);
    if (filters.location) queryParams.append('location', filters.location);
    if (filters.company) queryParams.append('company', filters.company);

    const response = await fetch(`http://localhost:3000/api/candidates/search?${queryParams}`);
    const data = await response.json();

    setResults(data.candidates);
  };

  return (
    <div className="p-6">
      <h1 className="text-2xl font-bold mb-6">Resume Database Search</h1>

      <div className="grid grid-cols-2 gap-4 mb-6">
        <input
          type="text"
          placeholder="Skills (comma-separated)"
          value={filters.skills}
          onChange={e => setFilters({...filters, skills: e.target.value})}
          className="border p-2 rounded"
        />

        <input
          type="number"
          placeholder="Min years experience"
          value={filters.minExperience}
          onChange={e => setFilters({...filters, minExperience: e.target.value})}
          className="border p-2 rounded"
        />

        <input
          type="text"
          placeholder="Location"
          value={filters.location}
          onChange={e => setFilters({...filters, location: e.target.value})}
          className="border p-2 rounded"
        />

        <input
          type="text"
          placeholder="Company"
          value={filters.company}
          onChange={e => setFilters({...filters, company: e.target.value})}
          className="border p-2 rounded"
        />
      </div>

      <button
        onClick={searchCandidates}
        className="bg-blue-600 text-white px-6 py-2 rounded hover:bg-blue-700"
      >
        Search Candidates
      </button>

      <div className="mt-8">
        <h2 className="text-xl font-semibold mb-4">
          Results ({results.length} candidates)
        </h2>

        <div className="space-y-4">
          {results.map(candidate => (
            <div key={candidate.id} className="border p-4 rounded shadow-sm">
              <h3 className="font-bold">{candidate.full_name}</h3>
              <p className="text-gray-600">{candidate.current_title}</p>
              <div className="mt-2 text-sm text-gray-500">
                <p>{candidate.location} • {candidate.years_experience} years experience</p>
                <p>{candidate.email}</p>
              </div>
            </div>
          ))}
        </div>
      </div>
    </div>
  );
}

9) Advanced analytics and reporting

Skill gap analysis - identify missing skills in your talent pool:

-- Define your target skill set and compare against available candidates
WITH desired_skills(skill) AS (
  VALUES
    ('Kubernetes'),
    ('Docker'),
    ('AWS'),
    ('Terraform'),
    ('Python'),
    ('Go'),
    ('React'),
    ('PostgreSQL')
)
SELECT
  d.skill,
  COUNT(s.skill_name) as candidates_with_skill,
  ROUND(COUNT(s.skill_name) * 100.0 / (SELECT COUNT(*) FROM candidates), 2) as coverage_percentage,
  (SELECT COUNT(*) FROM candidates) - COUNT(s.skill_name) as candidates_missing_skill
FROM desired_skills d
LEFT JOIN skills s ON LOWER(s.skill_name) = LOWER(d.skill)
GROUP BY d.skill
ORDER BY coverage_percentage DESC;

Experience distribution analysis:

SELECT
  CASE
    WHEN years_experience < 2 THEN '0-2 years (Junior)'
    WHEN years_experience < 5 THEN '2-5 years (Mid-level)'
    WHEN years_experience < 10 THEN '5-10 years (Senior)'
    ELSE '10+ years (Expert/Lead)'
  END as experience_level,
  COUNT(*) as candidate_count,
  ROUND(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM candidates), 2) as percentage
FROM candidates
GROUP BY experience_level
ORDER BY MIN(years_experience);

Top educational institutions:

SELECT
  e.institution,
  COUNT(DISTINCT e.candidate_id) as candidate_count,
  GROUP_CONCAT(DISTINCT e.degree) as degrees_offered,
  GROUP_CONCAT(DISTINCT e.field_of_study) as fields
FROM education e
GROUP BY e.institution
ORDER BY candidate_count DESC
LIMIT 20;

Career path analysis - common job progressions:

-- Find candidates who moved from Company A to Company B
SELECT
  c.full_name,
  c.email,
  we1.company as previous_company,
  we1.title as previous_title,
  we2.company as current_company,
  we2.title as current_title
FROM candidates c
JOIN work_experience we1 ON we1.candidate_id = c.id
JOIN work_experience we2 ON we2.candidate_id = c.id
WHERE we1.company = 'Google'
  AND we2.company = 'Amazon'
  AND we1.start_date < we2.start_date;

Diversity metrics - location distribution:

SELECT
  CASE
    WHEN location LIKE '%San Francisco%' OR location LIKE '%SF%' THEN 'San Francisco Bay Area'
    WHEN location LIKE '%New York%' OR location LIKE '%NYC%' THEN 'New York'
    WHEN location LIKE '%Seattle%' THEN 'Seattle'
    WHEN location LIKE '%Austin%' THEN 'Austin'
    WHEN location LIKE '%Remote%' THEN 'Remote'
    ELSE 'Other'
  END as location_group,
  COUNT(*) as candidate_count
FROM candidates
GROUP BY location_group
ORDER BY candidate_count DESC;

10) Best practices and troubleshooting

Schema design tips:

Extraction quality tips:

  1. Test with diverse formats: Canva templates, LaTeX academic CVs, Word documents, LinkedIn exports
  2. Handle edge cases:
    • Career gaps (unemployed periods)
    • Contractor/freelance work (many short-term positions)
    • International education (different degree naming conventions)
    • Bilingual resumes (choose primary language)
  3. Validate critical fields:
    • Email regex validation: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
    • Phone number normalization: convert all to +1-XXX-XXX-XXXX format
    • Date consistency: enforce YYYY-MM format for all date fields

Privacy and compliance:

Common extraction issues and fixes:

IssueCauseSolution
Skills as paragraphs instead of individual entriesVague promptPrompt: “Extract each skill separately. Return individual skills (e.g., ‘Python’, ‘AWS’), not full sentences.”
Inconsistent date formatsResumes use various formatsPrompt: “Return dates in YYYY-MM format. For ‘Jan 2020’, return ‘2020-01’.”
Missing work descriptionsSome resumes only list titlesAdjust expectation: make description optional, rely on technologies_used for details
Duplicate skills with slight variations”JavaScript” vs “Javascript” vs “JS”Post-processing: normalize skill names with a mapping table
Years of experience calculation wrongModel misunderstands overlapping positionsBe explicit: “Sum all work periods, excluding overlaps and internships”

11) Scaling to thousands of resumes

Batch processing workflow:

  1. Organize by job posting: Create separate PdfParse projects for each role or hiring campaign
  2. Process in batches: Upload 50-100 resumes at a time to monitor extraction quality
  3. Review and retry: Check error rates after each batch, refine prompts if >5% fail
  4. Download incrementally: Download SQLite after each batch, merge databases locally

Merging multiple SQLite databases:

import sqlite3

def merge_resume_databases(output_db: str, input_dbs: list):
    """Merge multiple resume databases into one."""
    conn_out = sqlite3.connect(output_db)
    cursor_out = conn_out.cursor()

    # Create tables in output database
    cursor_out.execute('''
        CREATE TABLE IF NOT EXISTS candidates (
            id INTEGER PRIMARY KEY,
            full_name TEXT,
            email TEXT,
            phone TEXT,
            location TEXT,
            linkedin_url TEXT,
            portfolio_url TEXT,
            summary TEXT,
            years_experience REAL,
            current_title TEXT
        )
    ''')

    # Similar CREATE TABLE statements for other tables...

    candidate_id_offset = 0

    for db_file in input_dbs:
        conn_in = sqlite3.connect(db_file)
        cursor_in = conn_in.cursor()

        # Copy candidates with ID offset
        cursor_in.execute('SELECT * FROM candidates')
        for row in cursor_in.fetchall():
            new_row = (row[0] + candidate_id_offset,) + row[1:]
            cursor_out.execute('''
                INSERT INTO candidates VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
            ''', new_row)

        # Copy work_experience with updated foreign keys
        cursor_in.execute('SELECT * FROM work_experience')
        for row in cursor_in.fetchall():
            new_row = (None, row[1] + candidate_id_offset) + row[2:]
            cursor_out.execute('''
                INSERT INTO work_experience (id, candidate_id, company, title, start_date, end_date, description, technologies_used)
                VALUES (?, ?, ?, ?, ?, ?, ?, ?)
            ''', new_row)

        # Update offset for next database
        cursor_in.execute('SELECT MAX(id) FROM candidates')
        max_id = cursor_in.fetchone()[0] or 0
        candidate_id_offset += max_id

        conn_in.close()

    conn_out.commit()
    conn_out.close()
    print(f"Merged {len(input_dbs)} databases into {output_db}")

# Merge multiple job posting databases
merge_resume_databases(
    output_db='all_candidates_2025.sqlite',
    input_dbs=[
        'backend_engineer_q1.sqlite',
        'frontend_engineer_q1.sqlite',
        'data_scientist_q1.sqlite'
    ]
)

Performance optimization for large databases:

-- Create indexes for common query patterns
CREATE INDEX idx_candidates_experience ON candidates(years_experience);
CREATE INDEX idx_candidates_location ON candidates(location);
CREATE INDEX idx_skills_name ON skills(skill_name);
CREATE INDEX idx_skills_candidate ON skills(candidate_id);
CREATE INDEX idx_work_company ON work_experience(company);
CREATE INDEX idx_work_candidate ON work_experience(candidate_id);

-- Analyze database for query optimization
ANALYZE;

12) Real-world use cases

Recruiting agency workflow:

  1. Create separate projects for each client company
  2. Process incoming resumes daily
  3. Download and merge weekly into master candidate database
  4. Run weekly reports: new skills trending, top universities, experience distribution
  5. Export matches to client ATS systems

Internal HR talent pool:

  1. Process all applicants from career site submissions
  2. Tag candidates by application date, source (LinkedIn, referral, career site)
  3. Quarterly reviews: re-screen previous applicants against new openings
  4. Build internal referral pipeline by searching for companies employees worked at

Portfolio company recruiting (VC/PE firms):

  1. Aggregate resumes across all portfolio companies
  2. Identify candidates who applied to multiple portfolio companies
  3. Cross-pollinate talent: suggest candidates to portfolio companies based on skill matches
  4. Track hiring metrics: time-to-hire, source quality, skill demand trends

Get started building your resume database

Ready to transform your recruiting workflow? Here’s how to get started:

  1. Create a PdfParse account - free tier includes 20 pages to test with sample resumes
  2. Set up your schema - copy the JSON schema from this article or customize to your needs
  3. Upload and process - drag and drop resume PDFs, click “Process Files”
  4. Review and refine - check extraction quality, adjust prompts if needed
  5. Download SQLite - get your portable database file ready for integration
  6. Build your workflow - integrate with Python, Excel, BI tools, or custom applications

Need help with your setup? Contact us at support@pdfparse.net for:


Continue learning about structured document extraction:



Previous Post
Build an Invoice Extractor with Child Tables and Foreign Keys
Next Post
Building a Robust Document Parsing System with Cloudflare Queues and Workflows