Managing hundreds of resume PDFs is a recruiting nightmare. You can’t search across them, can’t filter by skills or experience, and comparing candidates means opening dozens of files manually. This tutorial shows you how to transform that pile of unstructured PDFs into a queryable SQLite database using PdfParse - enabling powerful candidate searches, automated screening, and seamless ATS integration.
Demo placeholder (swap in your own clip):
<video controls width="100%" poster="/blog/placeholders/resume-demo-poster.png"><source src="/blog/resume-builder.webm" type="video/webm" /><source src="/blog/resume-builder.mp4" type="video/mp4" /></video>
1) Schema: parent + repeating children
A well-designed resume schema captures the hierarchical nature of candidate data:
candidates(parent): Core candidate information - name, contact, summary, years of experiencework_experience(child, repeating): Job history with companies, titles, dates, descriptionseducation(child, repeating): Degrees, institutions, graduation yearsskills(child, repeating): Technical skills, proficiencies, certifications
This normalized structure lets you run powerful queries like “find all candidates with 5+ years of Python experience who worked at FAANG companies.”
Copyable schema (tables)
candidates
| key | type | prompt |
|---|---|---|
| full_name | string | ”Candidate’s full name as shown on resume” |
| string | ”Primary email address” | |
| phone | string | ”Phone number with country code if present” |
| location | string | ”City and state/country (e.g., ‘San Francisco, CA’)“ |
| linkedin_url | string | ”LinkedIn profile URL if present” |
| portfolio_url | string | ”Personal website or portfolio URL if present” |
| summary | string | ”Professional summary or objective statement” |
| years_experience | number | ”Total years of professional work experience” |
| current_title | string | ”Most recent or current job title” |
work_experience (repeating child of candidates)
| key | type | prompt |
|---|---|---|
| candidate_id | number | ”Link to parent candidate id” |
| company | string | ”Company or organization name” |
| title | string | ”Job title or role” |
| start_date | date | ”Start date in YYYY-MM format” |
| end_date | string | ”End date in YYYY-MM format, or ‘Present’ if current” |
| description | string | ”Job responsibilities, achievements, and key projects” |
| technologies_used | string | ”Technologies, tools, or skills used in this role” |
education (repeating child of candidates)
| key | type | prompt |
|---|---|---|
| candidate_id | number | ”Link to parent candidate id” |
| institution | string | ”School, university, or educational institution name” |
| degree | string | ”Degree type: BS, BA, MS, MBA, PhD, etc.” |
| field_of_study | string | ”Major or field of study” |
| graduation_year | number | ”Year of graduation (YYYY format)“ |
| gpa | string | ”GPA if clearly stated (e.g., ‘3.8/4.0’)” |
skills (repeating child of candidates)
| key | type | prompt |
|---|---|---|
| candidate_id | number | ”Link to parent candidate id” |
| skill_name | string | ”Name of the skill, technology, or tool” |
| proficiency_level | string | ”Proficiency if stated: beginner, intermediate, advanced, expert” |
Copyable schema (JSON)
{
"candidates": {
"fields": [
{ "key": "full_name", "type": "string", "prompt": "Candidate's full name as shown on resume" },
{ "key": "email", "type": "string", "prompt": "Primary email address" },
{ "key": "phone", "type": "string", "prompt": "Phone number with country code if present" },
{ "key": "location", "type": "string", "prompt": "City and state/country (e.g., 'San Francisco, CA')" },
{ "key": "linkedin_url", "type": "string", "prompt": "LinkedIn profile URL if present" },
{ "key": "portfolio_url", "type": "string", "prompt": "Personal website or portfolio URL if present" },
{ "key": "summary", "type": "string", "prompt": "Professional summary or objective statement" },
{ "key": "years_experience", "type": "number", "prompt": "Total years of professional work experience" },
{ "key": "current_title", "type": "string", "prompt": "Most recent or current job title" }
]
},
"work_experience": {
"parent": "candidates",
"repeating": true,
"fields": [
{ "key": "candidate_id", "type": "number", "prompt": "Link to parent candidate id" },
{ "key": "company", "type": "string", "prompt": "Company or organization name" },
{ "key": "title", "type": "string", "prompt": "Job title or role" },
{ "key": "start_date", "type": "date", "prompt": "Start date in YYYY-MM format" },
{ "key": "end_date", "type": "string", "prompt": "End date in YYYY-MM format, or 'Present' if current" },
{ "key": "description", "type": "string", "prompt": "Job responsibilities, achievements, and key projects" },
{ "key": "technologies_used", "type": "string", "prompt": "Technologies, tools, or skills used in this role" }
]
},
"education": {
"parent": "candidates",
"repeating": true,
"fields": [
{ "key": "candidate_id", "type": "number", "prompt": "Link to parent candidate id" },
{ "key": "institution", "type": "string", "prompt": "School, university, or educational institution name" },
{ "key": "degree", "type": "string", "prompt": "Degree type: BS, BA, MS, MBA, PhD, etc." },
{ "key": "field_of_study", "type": "string", "prompt": "Major or field of study" },
{ "key": "graduation_year", "type": "number", "prompt": "Year of graduation (YYYY format)" },
{ "key": "gpa", "type": "string", "prompt": "GPA if clearly stated (e.g., '3.8/4.0')" }
]
},
"skills": {
"parent": "candidates",
"repeating": true,
"fields": [
{ "key": "candidate_id", "type": "number", "prompt": "Link to parent candidate id" },
{ "key": "skill_name", "type": "string", "prompt": "Name of the skill, technology, or tool" },
{ "key": "proficiency_level", "type": "string", "prompt": "Proficiency if stated: beginner, intermediate, advanced, expert" }
]
}
}
2) Why this schema structure
Normalized relational design unlocks powerful recruiting workflows:
- Avoid duplication: Store candidate contact info once, not repeated across every job entry
- Enable aggregation: Query across all work experience to find patterns (e.g., “candidates who worked at startups”)
- Support joins: Combine skills + experience + education for sophisticated matching
- Future-proof: Add new tables (certifications, projects, languages) without breaking existing queries
- BI tool friendly: Most business intelligence and reporting tools expect normalized foreign key relationships
This structure mirrors how recruiting databases and ATS (Applicant Tracking Systems) store candidate data - making integration seamless.
3) Schema prompts and extraction tips
Prompt engineering for clean data:
- Mark
work_experience,education, andskillsas repeating child tables in PdfParse - Add specific prompts per field - vague prompts like “get the title” produce inconsistent results
- For
end_date, explicitly allow “Present” as a string value for current positions - Extract
technologies_usedfrom job descriptions to supplement the skills table - Handle date variations: some resumes use “Jan 2020”, others “2020-01”, others “January 2020” - prompt for “YYYY-MM format”
Example prompt refinements:
skills.skill_name: "Extract individual skills from the skills section. Return each skill separately (e.g., 'Python', 'React', 'AWS'). Do not include proficiency levels in the skill name."
work_experience.description: "Complete description of responsibilities and achievements for this role. Include quantifiable results if present (e.g., 'increased sales by 30%')."
years_experience: "Calculate total years of professional work experience by summing all work history. Count partial years (e.g., 6 months = 0.5 years). Exclude internships unless specifically labeled as professional roles."
4) Powerful SQL queries for recruiting
Once you’ve processed resumes and downloaded your SQLite database, these queries unlock candidate intelligence:
Find candidates with specific skill combinations:
SELECT DISTINCT
c.full_name,
c.email,
c.current_title,
c.years_experience,
GROUP_CONCAT(DISTINCT s.skill_name) as matching_skills
FROM candidates c
JOIN skills s ON s.candidate_id = c.id
WHERE s.skill_name IN ('Python', 'React', 'AWS', 'Docker')
GROUP BY c.id
HAVING COUNT(DISTINCT s.skill_name) >= 3 -- Must have at least 3 of the 4 skills
ORDER BY c.years_experience DESC;
Search by experience level and location:
SELECT
full_name,
email,
location,
current_title,
years_experience
FROM candidates
WHERE years_experience BETWEEN 5 AND 10
AND location LIKE '%San Francisco%'
ORDER BY years_experience DESC;
Find candidates from target companies:
SELECT
c.full_name,
c.email,
c.phone,
we.company,
we.title,
we.start_date,
we.end_date
FROM candidates c
JOIN work_experience we ON we.candidate_id = c.id
WHERE we.company IN ('Google', 'Meta', 'Amazon', 'Microsoft', 'Apple')
ORDER BY c.full_name, we.start_date DESC;
Full-text search across job descriptions:
SELECT
c.full_name,
c.email,
we.company,
we.title,
we.description
FROM candidates c
JOIN work_experience we ON we.candidate_id = c.id
WHERE we.description LIKE '%machine learning%'
OR we.description LIKE '%artificial intelligence%'
OR we.technologies_used LIKE '%TensorFlow%'
OR we.technologies_used LIKE '%PyTorch%';
Filter by education criteria:
SELECT
c.full_name,
c.email,
e.degree,
e.field_of_study,
e.institution,
e.graduation_year
FROM candidates c
JOIN education e ON e.candidate_id = c.id
WHERE e.degree IN ('MS', 'PhD')
AND e.field_of_study LIKE '%Computer Science%'
ORDER BY e.graduation_year DESC;
Most in-demand skills in your candidate pool:
SELECT
skill_name,
COUNT(*) as candidate_count,
ROUND(COUNT(*) * 100.0 / (SELECT COUNT(DISTINCT candidate_id) FROM skills), 2) as percentage
FROM skills
GROUP BY skill_name
ORDER BY candidate_count DESC
LIMIT 20;
Candidates with recent experience at senior levels:
SELECT DISTINCT
c.full_name,
c.email,
c.years_experience,
we.title,
we.company,
we.end_date
FROM candidates c
JOIN work_experience we ON we.candidate_id = c.id
WHERE (we.title LIKE '%Senior%'
OR we.title LIKE '%Lead%'
OR we.title LIKE '%Principal%'
OR we.title LIKE '%Staff%')
AND (we.end_date = 'Present' OR we.end_date >= '2023-01')
ORDER BY c.years_experience DESC;
5) Download and integrate your resume database
From extraction to automation in three steps:
- Process your resumes: Upload PDFs to PdfParse, process with your schema, review any extraction errors
- Download SQLite database: Navigate to Dashboard → [Project Name] → Downloads and click “Download Database”
- Integrate anywhere: Use the portable
.sqlitefile in Python scripts, Excel, BI tools, or custom applications
Download demo placeholder (swap in your own clip):
<video controls width="100%" poster="/blog/placeholders/download-demo-poster.png"><source src="/blog/sqlite-download.webm" type="video/webm" /><source src="/blog/sqlite-download.mp4" type="video/mp4" /></video>
Breadcrumb navigation: Dashboard → Projects → [Your Project] → Downloads → Download Database
The downloaded SQLite file contains all your tables with proper foreign key relationships - ready to query immediately.
6) Python integration for automated screening
Build a candidate matching engine:
import sqlite3
import pandas as pd
from typing import List, Dict
def find_matching_candidates(
db_path: str,
required_skills: List[str],
min_experience: int = 0,
target_companies: List[str] = None,
location: str = None
) -> pd.DataFrame:
"""
Find candidates matching specific criteria.
Args:
db_path: Path to the downloaded SQLite database
required_skills: List of required skills (e.g., ['Python', 'React'])
min_experience: Minimum years of experience
target_companies: Optional list of companies to filter by
location: Optional location filter (partial match)
Returns:
DataFrame of matching candidates with their details
"""
conn = sqlite3.connect(db_path)
# Build dynamic query based on filters
query = """
SELECT DISTINCT
c.id,
c.full_name,
c.email,
c.phone,
c.location,
c.current_title,
c.years_experience,
c.linkedin_url,
GROUP_CONCAT(DISTINCT s.skill_name) as all_skills,
GROUP_CONCAT(DISTINCT we.company) as companies
FROM candidates c
LEFT JOIN skills s ON s.candidate_id = c.id
LEFT JOIN work_experience we ON we.candidate_id = c.id
WHERE c.years_experience >= ?
"""
params = [min_experience]
if location:
query += " AND c.location LIKE ?"
params.append(f"%{location}%")
query += " GROUP BY c.id"
# Execute query
df = pd.read_sql_query(query, conn, params=params)
# Filter by required skills
if required_skills:
def has_required_skills(skills_str):
if pd.isna(skills_str):
return False
skills = skills_str.lower().split(',')
return all(
any(req.lower() in skill for skill in skills)
for req in required_skills
)
df = df[df['all_skills'].apply(has_required_skills)]
# Filter by target companies
if target_companies:
def worked_at_target(companies_str):
if pd.isna(companies_str):
return False
return any(
target.lower() in companies_str.lower()
for target in target_companies
)
df = df[df['companies'].apply(worked_at_target)]
conn.close()
return df.sort_values('years_experience', ascending=False)
# Example usage
candidates = find_matching_candidates(
db_path='resumes.sqlite',
required_skills=['Python', 'AWS', 'Docker'],
min_experience=5,
target_companies=['Google', 'Amazon', 'Microsoft'],
location='San Francisco'
)
print(f"Found {len(candidates)} matching candidates:")
print(candidates[['full_name', 'email', 'years_experience', 'current_title']])
Export for ATS import:
import sqlite3
import csv
def export_candidates_to_csv(db_path: str, output_file: str):
"""Export all candidates to CSV for ATS import."""
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
cursor.execute("""
SELECT
c.full_name,
c.email,
c.phone,
c.location,
c.current_title,
c.years_experience,
c.linkedin_url,
GROUP_CONCAT(DISTINCT s.skill_name, '; ') as skills,
GROUP_CONCAT(DISTINCT we.company || ' (' || we.title || ')', '; ') as experience
FROM candidates c
LEFT JOIN skills s ON s.candidate_id = c.id
LEFT JOIN work_experience we ON we.candidate_id = c.id
GROUP BY c.id
ORDER BY c.full_name
""")
with open(output_file, 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow([
'Full Name', 'Email', 'Phone', 'Location',
'Current Title', 'Years Experience', 'LinkedIn',
'Skills', 'Work History'
])
writer.writerows(cursor.fetchall())
conn.close()
print(f"Exported to {output_file}")
# Export all candidates
export_candidates_to_csv('resumes.sqlite', 'candidates_import.csv')
AI-powered candidate screening:
from anthropic import Anthropic
import sqlite3
def ai_screen_candidate(
db_path: str,
candidate_id: int,
job_description: str
) -> Dict[str, any]:
"""
Use Claude to evaluate candidate fit for a specific role.
Args:
db_path: Path to SQLite database
candidate_id: ID of candidate to evaluate
job_description: Full job description text
Returns:
Dictionary with score, reasoning, and strengths/concerns
"""
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
# Get complete candidate profile
cursor.execute("""
SELECT
c.*,
GROUP_CONCAT(DISTINCT we.company || ' - ' || we.title ||
' (' || we.start_date || ' to ' || we.end_date || '): ' ||
COALESCE(we.description, 'No description'), '\n\n') as experience,
GROUP_CONCAT(DISTINCT s.skill_name, ', ') as skills,
GROUP_CONCAT(DISTINCT e.degree || ' in ' || e.field_of_study ||
' from ' || e.institution || ' (' || e.graduation_year || ')', '\n') as education
FROM candidates c
LEFT JOIN work_experience we ON we.candidate_id = c.id
LEFT JOIN skills s ON s.candidate_id = c.id
LEFT JOIN education e ON e.candidate_id = c.id
WHERE c.id = ?
GROUP BY c.id
""", [candidate_id])
candidate = cursor.fetchone()
conn.close()
if not candidate:
return {"error": "Candidate not found"}
# Build candidate summary
candidate_summary = f"""
CANDIDATE PROFILE
Name: {candidate[1]}
Location: {candidate[4]}
Current Title: {candidate[9]}
Years of Experience: {candidate[8]}
Email: {candidate[2]}
LinkedIn: {candidate[5]}
SUMMARY
{candidate[7] or 'No summary provided'}
WORK EXPERIENCE
{candidate[10] or 'No work experience listed'}
SKILLS
{candidate[11] or 'No skills listed'}
EDUCATION
{candidate[12] or 'No education listed'}
"""
client = Anthropic(api_key="your-api-key")
message = client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=2000,
messages=[{
"role": "user",
"content": f"""You are an expert technical recruiter. Evaluate this candidate's fit for the following role.
JOB DESCRIPTION:
{job_description}
{candidate_summary}
Provide your evaluation in this format:
1. OVERALL FIT SCORE: [1-10]
2. KEY STRENGTHS: [bullet points]
3. POTENTIAL CONCERNS: [bullet points]
4. RECOMMENDATION: [Hire/Interview/Pass with reasoning]
5. SUGGESTED INTERVIEW QUESTIONS: [3-5 questions to validate fit]
"""
}]
)
return {
"candidate_id": candidate_id,
"candidate_name": candidate[1],
"evaluation": message.content[0].text
}
# Screen a candidate
job_desc = """
Senior Backend Engineer - Python/AWS
We're seeking a senior backend engineer with 5+ years of experience building
scalable APIs using Python, AWS, and modern database technologies...
"""
result = ai_screen_candidate('resumes.sqlite', candidate_id=42, job_description=job_desc)
print(result['evaluation'])
7) Excel and BI tool integration
Connect Excel to your resume database:
- Download and install the SQLite ODBC Driver
- In Excel: Data → Get Data → From Database → From ODBC
- Select your downloaded
.sqlitefile - Choose tables to import (candidates, skills, work_experience, education)
- Use Power Query to create relationships between tables
- Build pivot tables, charts, and dashboards
Power BI integration:
- Open Power BI Desktop
- Get Data → More → Database → SQLite
- Browse to your downloaded database file
- Select all tables and load
- Power BI auto-detects foreign key relationships
- Create visualizations:
- Skills distribution (bar chart)
- Candidates by location (map)
- Experience level breakdown (pie chart)
- Top companies represented (tree map)
Google Sheets integration (requires Google Apps Script):
// Google Apps Script to query SQLite via web service
function importCandidates() {
// First, upload your SQLite to a web-accessible endpoint
// or use a service like Datasette to expose an API
const response = UrlFetchApp.fetch('https://your-datasette-instance.com/resumes/candidates.json?_shape=array');
const candidates = JSON.parse(response.getContentText());
const sheet = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet();
// Clear existing data
sheet.clear();
// Write headers
sheet.getRange(1, 1, 1, 6).setValues([[
'Name', 'Email', 'Location', 'Title', 'Experience', 'Skills'
]]);
// Write candidate data
const rows = candidates.map(c => [
c.full_name,
c.email,
c.location,
c.current_title,
c.years_experience,
c.skills
]);
sheet.getRange(2, 1, rows.length, 6).setValues(rows);
}
8) Building a resume search API
Simple Express/TypeScript API for internal recruiting tools:
import express from 'express';
import Database from 'better-sqlite3';
import cors from 'cors';
const app = express();
const db = new Database('resumes.sqlite', { readonly: true });
app.use(cors());
app.use(express.json());
// Search candidates by multiple criteria
app.get('/api/candidates/search', (req, res) => {
const {
skills,
minExperience,
maxExperience,
location,
company,
degree
} = req.query;
let query = `
SELECT DISTINCT c.*
FROM candidates c
WHERE 1=1
`;
const params: any[] = [];
if (minExperience) {
query += ' AND c.years_experience >= ?';
params.push(Number(minExperience));
}
if (maxExperience) {
query += ' AND c.years_experience <= ?';
params.push(Number(maxExperience));
}
if (location) {
query += ' AND c.location LIKE ?';
params.push(`%${location}%`);
}
let candidates = db.prepare(query).all(...params);
// Filter by skills (requires join)
if (skills) {
const skillList = (skills as string).split(',').map(s => s.trim().toLowerCase());
candidates = candidates.filter(c => {
const candidateSkills = db.prepare(
'SELECT skill_name FROM skills WHERE candidate_id = ?'
).all(c.id);
const candidateSkillNames = candidateSkills.map(
(s: any) => s.skill_name.toLowerCase()
);
return skillList.every(requiredSkill =>
candidateSkillNames.some(candidateSkill =>
candidateSkill.includes(requiredSkill)
)
);
});
}
// Filter by company experience
if (company) {
candidates = candidates.filter(c => {
const experience = db.prepare(
'SELECT company FROM work_experience WHERE candidate_id = ?'
).all(c.id);
return experience.some((exp: any) =>
exp.company.toLowerCase().includes((company as string).toLowerCase())
);
});
}
// Filter by education
if (degree) {
candidates = candidates.filter(c => {
const education = db.prepare(
'SELECT degree FROM education WHERE candidate_id = ?'
).all(c.id);
return education.some((edu: any) =>
edu.degree.toLowerCase() === (degree as string).toLowerCase()
);
});
}
res.json({
count: candidates.length,
candidates: candidates
});
});
// Get full candidate profile
app.get('/api/candidates/:id', (req, res) => {
const candidate = db.prepare('SELECT * FROM candidates WHERE id = ?').get(req.params.id);
if (!candidate) {
return res.status(404).json({ error: 'Candidate not found' });
}
const workExperience = db.prepare(
'SELECT * FROM work_experience WHERE candidate_id = ? ORDER BY start_date DESC'
).all(req.params.id);
const education = db.prepare(
'SELECT * FROM education WHERE candidate_id = ? ORDER BY graduation_year DESC'
).all(req.params.id);
const skills = db.prepare(
'SELECT * FROM skills WHERE candidate_id = ?'
).all(req.params.id);
res.json({
...candidate,
work_experience: workExperience,
education: education,
skills: skills
});
});
// Get skills analytics
app.get('/api/analytics/skills', (req, res) => {
const skillStats = db.prepare(`
SELECT
skill_name,
COUNT(*) as candidate_count,
ROUND(COUNT(*) * 100.0 / (SELECT COUNT(DISTINCT candidate_id) FROM skills), 2) as percentage
FROM skills
GROUP BY skill_name
ORDER BY candidate_count DESC
LIMIT 50
`).all();
res.json(skillStats);
});
// Get company distribution
app.get('/api/analytics/companies', (req, res) => {
const companyStats = db.prepare(`
SELECT
company,
COUNT(DISTINCT candidate_id) as candidate_count
FROM work_experience
GROUP BY company
ORDER BY candidate_count DESC
LIMIT 30
`).all();
res.json(companyStats);
});
app.listen(3000, () => {
console.log('Resume API running on http://localhost:3000');
});
Frontend search interface example (React):
import { useState } from 'react';
interface Candidate {
id: number;
full_name: string;
email: string;
location: string;
current_title: string;
years_experience: number;
}
export default function CandidateSearch() {
const [results, setResults] = useState<Candidate[]>([]);
const [filters, setFilters] = useState({
skills: '',
minExperience: '',
location: '',
company: ''
});
const searchCandidates = async () => {
const queryParams = new URLSearchParams();
if (filters.skills) queryParams.append('skills', filters.skills);
if (filters.minExperience) queryParams.append('minExperience', filters.minExperience);
if (filters.location) queryParams.append('location', filters.location);
if (filters.company) queryParams.append('company', filters.company);
const response = await fetch(`http://localhost:3000/api/candidates/search?${queryParams}`);
const data = await response.json();
setResults(data.candidates);
};
return (
<div className="p-6">
<h1 className="text-2xl font-bold mb-6">Resume Database Search</h1>
<div className="grid grid-cols-2 gap-4 mb-6">
<input
type="text"
placeholder="Skills (comma-separated)"
value={filters.skills}
onChange={e => setFilters({...filters, skills: e.target.value})}
className="border p-2 rounded"
/>
<input
type="number"
placeholder="Min years experience"
value={filters.minExperience}
onChange={e => setFilters({...filters, minExperience: e.target.value})}
className="border p-2 rounded"
/>
<input
type="text"
placeholder="Location"
value={filters.location}
onChange={e => setFilters({...filters, location: e.target.value})}
className="border p-2 rounded"
/>
<input
type="text"
placeholder="Company"
value={filters.company}
onChange={e => setFilters({...filters, company: e.target.value})}
className="border p-2 rounded"
/>
</div>
<button
onClick={searchCandidates}
className="bg-blue-600 text-white px-6 py-2 rounded hover:bg-blue-700"
>
Search Candidates
</button>
<div className="mt-8">
<h2 className="text-xl font-semibold mb-4">
Results ({results.length} candidates)
</h2>
<div className="space-y-4">
{results.map(candidate => (
<div key={candidate.id} className="border p-4 rounded shadow-sm">
<h3 className="font-bold">{candidate.full_name}</h3>
<p className="text-gray-600">{candidate.current_title}</p>
<div className="mt-2 text-sm text-gray-500">
<p>{candidate.location} • {candidate.years_experience} years experience</p>
<p>{candidate.email}</p>
</div>
</div>
))}
</div>
</div>
</div>
);
}
9) Advanced analytics and reporting
Skill gap analysis - identify missing skills in your talent pool:
-- Define your target skill set and compare against available candidates
WITH desired_skills(skill) AS (
VALUES
('Kubernetes'),
('Docker'),
('AWS'),
('Terraform'),
('Python'),
('Go'),
('React'),
('PostgreSQL')
)
SELECT
d.skill,
COUNT(s.skill_name) as candidates_with_skill,
ROUND(COUNT(s.skill_name) * 100.0 / (SELECT COUNT(*) FROM candidates), 2) as coverage_percentage,
(SELECT COUNT(*) FROM candidates) - COUNT(s.skill_name) as candidates_missing_skill
FROM desired_skills d
LEFT JOIN skills s ON LOWER(s.skill_name) = LOWER(d.skill)
GROUP BY d.skill
ORDER BY coverage_percentage DESC;
Experience distribution analysis:
SELECT
CASE
WHEN years_experience < 2 THEN '0-2 years (Junior)'
WHEN years_experience < 5 THEN '2-5 years (Mid-level)'
WHEN years_experience < 10 THEN '5-10 years (Senior)'
ELSE '10+ years (Expert/Lead)'
END as experience_level,
COUNT(*) as candidate_count,
ROUND(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM candidates), 2) as percentage
FROM candidates
GROUP BY experience_level
ORDER BY MIN(years_experience);
Top educational institutions:
SELECT
e.institution,
COUNT(DISTINCT e.candidate_id) as candidate_count,
GROUP_CONCAT(DISTINCT e.degree) as degrees_offered,
GROUP_CONCAT(DISTINCT e.field_of_study) as fields
FROM education e
GROUP BY e.institution
ORDER BY candidate_count DESC
LIMIT 20;
Career path analysis - common job progressions:
-- Find candidates who moved from Company A to Company B
SELECT
c.full_name,
c.email,
we1.company as previous_company,
we1.title as previous_title,
we2.company as current_company,
we2.title as current_title
FROM candidates c
JOIN work_experience we1 ON we1.candidate_id = c.id
JOIN work_experience we2 ON we2.candidate_id = c.id
WHERE we1.company = 'Google'
AND we2.company = 'Amazon'
AND we1.start_date < we2.start_date;
Diversity metrics - location distribution:
SELECT
CASE
WHEN location LIKE '%San Francisco%' OR location LIKE '%SF%' THEN 'San Francisco Bay Area'
WHEN location LIKE '%New York%' OR location LIKE '%NYC%' THEN 'New York'
WHEN location LIKE '%Seattle%' THEN 'Seattle'
WHEN location LIKE '%Austin%' THEN 'Austin'
WHEN location LIKE '%Remote%' THEN 'Remote'
ELSE 'Other'
END as location_group,
COUNT(*) as candidate_count
FROM candidates
GROUP BY location_group
ORDER BY candidate_count DESC;
10) Best practices and troubleshooting
Schema design tips:
- Start simple: Begin with core fields, add optional fields after testing with real resumes
- Use child tables for repeating data: Work experience, education, skills, certifications should always be child tables
- Consider JSON arrays for highly variable data: Awards, publications, languages, hobbies work well as JSON if they’re not searchable criteria
- Normalize contact info: Store one email/phone per candidate to avoid duplicates
Extraction quality tips:
- Test with diverse formats: Canva templates, LaTeX academic CVs, Word documents, LinkedIn exports
- Handle edge cases:
- Career gaps (unemployed periods)
- Contractor/freelance work (many short-term positions)
- International education (different degree naming conventions)
- Bilingual resumes (choose primary language)
- Validate critical fields:
- Email regex validation:
^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$ - Phone number normalization: convert all to
+1-XXX-XXX-XXXXformat - Date consistency: enforce YYYY-MM format for all date fields
- Email regex validation:
Privacy and compliance:
- Encrypt at rest: Use SQLCipher or encrypt the SQLite file with strong encryption before storing
- GDPR compliance: Implement data retention policies (delete candidates after N months without activity)
- Redact sensitive info: Before sharing databases externally, remove personal identifiers:
UPDATE candidates SET email = 'redacted_' || id || '@example.com', phone = 'XXX-XXX-' || SUBSTR(phone, -4), full_name = 'Candidate ' || id; - Access controls: Limit who can download the full database; consider role-based exports
Common extraction issues and fixes:
| Issue | Cause | Solution |
|---|---|---|
| Skills as paragraphs instead of individual entries | Vague prompt | Prompt: “Extract each skill separately. Return individual skills (e.g., ‘Python’, ‘AWS’), not full sentences.” |
| Inconsistent date formats | Resumes use various formats | Prompt: “Return dates in YYYY-MM format. For ‘Jan 2020’, return ‘2020-01’.” |
| Missing work descriptions | Some resumes only list titles | Adjust expectation: make description optional, rely on technologies_used for details |
| Duplicate skills with slight variations | ”JavaScript” vs “Javascript” vs “JS” | Post-processing: normalize skill names with a mapping table |
| Years of experience calculation wrong | Model misunderstands overlapping positions | Be explicit: “Sum all work periods, excluding overlaps and internships” |
11) Scaling to thousands of resumes
Batch processing workflow:
- Organize by job posting: Create separate PdfParse projects for each role or hiring campaign
- Process in batches: Upload 50-100 resumes at a time to monitor extraction quality
- Review and retry: Check error rates after each batch, refine prompts if >5% fail
- Download incrementally: Download SQLite after each batch, merge databases locally
Merging multiple SQLite databases:
import sqlite3
def merge_resume_databases(output_db: str, input_dbs: list):
"""Merge multiple resume databases into one."""
conn_out = sqlite3.connect(output_db)
cursor_out = conn_out.cursor()
# Create tables in output database
cursor_out.execute('''
CREATE TABLE IF NOT EXISTS candidates (
id INTEGER PRIMARY KEY,
full_name TEXT,
email TEXT,
phone TEXT,
location TEXT,
linkedin_url TEXT,
portfolio_url TEXT,
summary TEXT,
years_experience REAL,
current_title TEXT
)
''')
# Similar CREATE TABLE statements for other tables...
candidate_id_offset = 0
for db_file in input_dbs:
conn_in = sqlite3.connect(db_file)
cursor_in = conn_in.cursor()
# Copy candidates with ID offset
cursor_in.execute('SELECT * FROM candidates')
for row in cursor_in.fetchall():
new_row = (row[0] + candidate_id_offset,) + row[1:]
cursor_out.execute('''
INSERT INTO candidates VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
''', new_row)
# Copy work_experience with updated foreign keys
cursor_in.execute('SELECT * FROM work_experience')
for row in cursor_in.fetchall():
new_row = (None, row[1] + candidate_id_offset) + row[2:]
cursor_out.execute('''
INSERT INTO work_experience (id, candidate_id, company, title, start_date, end_date, description, technologies_used)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)
''', new_row)
# Update offset for next database
cursor_in.execute('SELECT MAX(id) FROM candidates')
max_id = cursor_in.fetchone()[0] or 0
candidate_id_offset += max_id
conn_in.close()
conn_out.commit()
conn_out.close()
print(f"Merged {len(input_dbs)} databases into {output_db}")
# Merge multiple job posting databases
merge_resume_databases(
output_db='all_candidates_2025.sqlite',
input_dbs=[
'backend_engineer_q1.sqlite',
'frontend_engineer_q1.sqlite',
'data_scientist_q1.sqlite'
]
)
Performance optimization for large databases:
-- Create indexes for common query patterns
CREATE INDEX idx_candidates_experience ON candidates(years_experience);
CREATE INDEX idx_candidates_location ON candidates(location);
CREATE INDEX idx_skills_name ON skills(skill_name);
CREATE INDEX idx_skills_candidate ON skills(candidate_id);
CREATE INDEX idx_work_company ON work_experience(company);
CREATE INDEX idx_work_candidate ON work_experience(candidate_id);
-- Analyze database for query optimization
ANALYZE;
12) Real-world use cases
Recruiting agency workflow:
- Create separate projects for each client company
- Process incoming resumes daily
- Download and merge weekly into master candidate database
- Run weekly reports: new skills trending, top universities, experience distribution
- Export matches to client ATS systems
Internal HR talent pool:
- Process all applicants from career site submissions
- Tag candidates by application date, source (LinkedIn, referral, career site)
- Quarterly reviews: re-screen previous applicants against new openings
- Build internal referral pipeline by searching for companies employees worked at
Portfolio company recruiting (VC/PE firms):
- Aggregate resumes across all portfolio companies
- Identify candidates who applied to multiple portfolio companies
- Cross-pollinate talent: suggest candidates to portfolio companies based on skill matches
- Track hiring metrics: time-to-hire, source quality, skill demand trends
Get started building your resume database
Ready to transform your recruiting workflow? Here’s how to get started:
- Create a PdfParse account - free tier includes 20 pages to test with sample resumes
- Set up your schema - copy the JSON schema from this article or customize to your needs
- Upload and process - drag and drop resume PDFs, click “Process Files”
- Review and refine - check extraction quality, adjust prompts if needed
- Download SQLite - get your portable database file ready for integration
- Build your workflow - integrate with Python, Excel, BI tools, or custom applications
Need help with your setup? Contact us at support@pdfparse.net for:
- Custom schema design for specialized roles (medical, legal, executive)
- Bulk processing guidance for thousands of resumes
- ATS and HRIS integration support
- Privacy and compliance consulting (GDPR, CCPA)
- API access for automated workflows
Related tutorials
Continue learning about structured document extraction:
- Build an Invoice Extractor with Child Tables and Foreign Keys - Learn the parent-child table pattern for invoices with line items
- Build a Bank Statement Extractor with JSON Arrays - See how to model transaction data using JSON arrays for flexibility
- Introducing PdfParse - Product overview and core features walkthrough