Accessibility Audit

Check your PDFs for accessibility

Upload a PDF to get an instant accessibility score based on WCAG 2.1 and ADA Title II requirements. The audit checks nine categories — text extractability, heading structure, alt text, table markup, and more — and returns a detailed report with actionable findings.

Drop PDF files here

or click to browse — up to 5 files, max 15 MB each

Technical Details: How This Tool Analyzes & Remediates PDFs

Overview: What This Tool Does

This tool checks whether a PDF document can be read by people who use assistive technology — screen readers, braille displays, and other tools used by people with disabilities. It does this by examining the internal structure of the PDF file, not just its visual appearance. A PDF that looks fine on screen may be completely unreadable to a screen reader if it lacks the right internal markup.

The tool evaluates PDFs against WCAG 2.1 Level AA (the international standard for web content accessibility) and ADA Title II digital accessibility requirements (U.S. federal law requiring state and local government digital content to be accessible, effective April 2026).

What Is a PDF, Really? (And Why It's Different from Word)

To understand why some PDFs are accessible and others aren't — and why "fixing" an inaccessible PDF can be so much harder than it looks — it helps to know what a PDF actually is under the hood. Most people use PDFs every day without ever thinking about it. Here's the short version.

A PDF is an export, not a source document. Adobe created the Portable Document Format in 1993 to solve a specific problem: making a file that looks identical on every printer, every monitor, every operating system. You don't write in a PDF — you write in Word, InDesign, Pages, or Google Docs, and then you export to PDF when you want to share the finished result. PDF is the printed-and- mailed envelope at the end of the workflow, not the word-processor you used to draft the letter.

The difference between Word and PDF is about what each format stores:

Word (.docx) says: <h1>Annual Report 2024</h1> <p>In fiscal year 2024…</p> <img alt="Bar chart showing arrests by month" src="…" /> PDF says: Page 1, x=72, y=720, font=Arial-Bold, size=24pt: glyph 'A' Page 1, x=85, y=720, font=Arial-Bold, size=24pt: glyph 'n' Page 1, x=98, y=720, font=Arial-Bold, size=24pt: glyph 'n' Page 1, x=72, y=680, font=Arial, size=11pt: glyph 'I' Page 1, x=78, y=680, font=Arial, size=11pt: glyph 'n' Page 1, x=72, y=200, image XObject ref=42 (768 x 432 pixels) …

Word stores the meaning of your content. The <h1> tag tells any program reading the file: "this is a top-level heading." The <img> tag has an alt attribute that describes the picture. A screen reader can read a Word file and navigate it like a webpage because the meaning is right there in the file.

PDF stores where every glyph goes on the page. That's it. A PDF doesn't natively know which glyphs are a heading and which are a paragraph — only that this letter is here, that letter is there, in this font, in this color. When you read a PDF, your brain does the work of recognizing "the big bold text at the top must be a heading." A screen reader can't do that from glyph positions alone — it would just read each glyph in sequence, which sounds like gibberish.

So how can a PDF be accessible at all? Starting in 2001 (PDF version 1.4), Adobe added an optional second layer to the format called the structure tree (or "tags"). This is a separate invisible layer that runs alongside the visual content and says "the glyphs that draw 'Annual Report 2024' belong to a <H1> element. The image at x=72, y=200 is a <Figure> element with alt-text 'Bar chart showing arrests by month'." Screen readers read the structure tree first, then jump to the visual content based on what the tree tells them.

A PDF that has this layer is called a "tagged PDF." A PDF without it is "untagged." Whether a PDF gets tagged depends on how it was exported. In Word: File → Save As → PDF → Options → "Document structure tags for accessibility" (checked by default in recent versions, but commonly turned off on older Office installs or "minimum size" exports). In InDesign: File → Export → Adobe PDF (Print) → "Create Tagged PDF". Pages and Google Docs are similar. If that box is unchecked, you get an untagged PDF — visually identical, but invisible to screen readers.

The structure tree itself looks like a webpage's DOM tree, because it borrows the same ideas:

StructTreeRoot └── Document ├── H1 "Annual Report 2024" ├── P "In fiscal year 2024, the agency processed…" ├── Figure (/Alt "Bar chart showing arrests by month") ├── H2 "Methodology" ├── P "Data was collected from…" └── Table ├── TR │ ├── TH (Scope=Col) "County" │ ├── TH (Scope=Col) "Arrests" │ └── TH (Scope=Col) "Year" └── TR ├── TD "Cook" ├── TD "12,345" └── TD "2024"

Standard tag types include Document, Sect, H1 through H6, P, L / LI (list / list item), Table / TR / TH / TD, Figure, Caption, Form, Link, and Artifact (used for purely decorative content that screen readers should skip). Each can carry attributes like /Alt (alt text for figures), /Lang (language declaration), and Scope (whether a TH is a row or column header).

Linking these tags back to the glyphs they describe uses Marked Content Identifiers (MCIDs). Each chunk of content in the page's drawing instructions is wrapped in a marker (/MCID 7 … /EMC), and the corresponding structure tree node points back at that marker. It's the same idea as id attributes connecting HTML elements to JavaScript handlers — a separate identifier layer that knits two parallel representations together.

This architecture is why retrofitting accessibility into an existing PDF is so much harder than getting it right at export. When Word exports a tagged PDF, it already knows your headings are headings — it just copies that semantic information into the structure tree. When somebody hands you an untagged PDF and asks you to fix it, the only thing left is the glyph positions. Reverse-engineering "what was this heading?" from "14-pt bold text at the top of page 2" is what auto-remediation tools attempt, but with the same fundamental limitation a human would have: it's a guess based on visual cues, not a recall of authorial intent.

The practical takeaway: the most reliable path to an accessible PDF is to fix accessibility issues in the source document (Word, InDesign, etc.) and re-export with tagging enabled. The next-best path — and what this tool's optional auto-remediation feature does — is to take an already-exported PDF and add structure tags after the fact. The audit results page surfaces this distinction in the "Best path to accessibility starts at the source" notice.

How It Works

When you upload a PDF, the server runs two independent, open-source analysis tools in parallel — one reads the PDF's internal structure (tags, bookmarks, form fields), the other extracts text and metadata from every page. Their combined output feeds a scorer that evaluates nine accessibility categories and produces a weighted overall score. No data is sent to third-party services or AI models — all processing happens on the server (hosted on DigitalOcean cloud infrastructure). The uploaded PDF is deleted immediately after analysis — no PDF content is retained on the server.

PDF → [validate file type & size] → parallel { QPDF (structure), PDF.js (content) } → Scorer (9 categories) → Weighted Score → Report
Audit pipeline — visual flow
flowchart TD
    A[Browser uploads PDF] --> B[Magic-byte + size check]
    B --> C[Hold in memory only]
    C --> D[qpdf analyzes structure]
    C --> E[pdfjs extracts content]
    D --> F[Combined results]
    E --> F
    F --> G[Scorer applies 9 categories]
    G --> H[A-F grade + findings]
    H --> I[Return to browser]

Browser uploads PDF; the server validates magic bytes and size, holds the file in memory, runs qpdf and pdfjs in parallel against it, combines the results, scores nine categories, returns the grade and findings to the browser, and discards the memory buffer.

Application Architecture

The application is a monorepo with two components, both running on the same DigitalOcean droplet:

Frontend (port 5102)

A Nuxt 4 (Vue 3) web application that provides the user interface — the upload form, progress indicators, score cards, export buttons, and shareable report pages. Styled with Tailwind CSS and Nuxt UI. Served as a server-rendered app via Nitro.

Backend API (port 5103)

An Express (Node.js/TypeScript) server that handles file uploads, runs QPDF and PDF.js analysis, scores the results, manages authentication (passwordless OTP via email), and stores shared reports in a SQLite database (WAL mode). Managed by PM2 in production.

Both processes are managed by PM2 behind an nginx reverse proxy on a single DigitalOcean droplet provisioned via Laravel Forge. The frontend proxies API requests to the backend — the user's browser never communicates directly with the API server.

Application architecture
flowchart TD
    A[Browser] --> B[Nginx reverse proxy]
    B --> C[Nuxt web app on port 5102]
    B --> D[Express API on port 5103]
    C --> D
    D --> E[qpdf binary]
    D --> F[OpenDataLoader Java]
    D --> G[veraPDF Java]
    D --> H[SQLite database]

Browser talks to Nginx reverse proxy. Nginx routes to either the Nuxt web app (port 5102) or the Express API (port 5103). The web app makes some API calls back to Express. Express shells out to qpdf, OpenDataLoader Java, and veraPDF Java; it reads and writes SQLite locally. No external services.

Tool 1: QPDF (PDF Structure Extraction)

QPDF is an open-source C++ command-line program for inspecting and transforming PDF files. It is maintained by Jay Berkenbilt and is widely used in PDF archival libraries, digital preservation projects, and accessibility workflows. Think of QPDF as a tool that can "open up" a PDF and read its internal blueprint — not just the words on the page, but the hidden structural information that tells assistive technology how the document is organized.

How it's called: The server invokes QPDF as a subprocess with the --json flag, which outputs the PDF's complete internal object graph as machine-readable JSON. The server writes the uploaded PDF to a temporary file, runs qpdf --json /tmp/<uuid>.pdf, parses the resulting JSON, and immediately deletes the temp file. The subprocess has a 30-second timeout and a 50 MB output buffer to handle complex documents safely.

Why QPDF? A PDF file is not a simple document — internally, it is a collection of numbered "objects" (text streams, images, fonts, bookmarks, form fields, tags) connected by cross-references. QPDF can decode and dump this entire object graph as structured data, which lets the tool inspect every accessibility-relevant feature without relying on visual rendering. No other open-source tool provides this level of structural access to PDFs.

What QPDF extracts

DataPDF SourceUsed For
StructTreeRoot Catalog /StructTreeRoot Whether the PDF is "tagged" (has a semantic structure tree)
Language declarationCatalog /Lang Language accessibility (screen reader pronunciation)
Headings (H1–H6) Structure elements with /S = /H, /H1/H6 Heading presence, hierarchy validation, level-skip detection
Outlines / Bookmarks/Outlines/First//Next chain Bookmark count for navigation scoring
Tables & structure Structure elements /Table, /TR, /TH, /TD, /Caption, /Scope, /Headers Header cells, scope attributes, row structure, nesting, captions, column consistency, header-data associations
Images & figures XObjects (/Image) + structure elements (/Figure with /Alt) Image detection and alt text presence
Form fields Widget annotations + /AcroForm/Fields + /TU tooltip Whether form fields have accessible labels
Reading order MCIDs Numeric /K values (Marked Content IDs) in structure tree Content sequence validation — detects out-of-order reading
Lists Structure elements /L, /LI, /Lbl, /LBody List detection, well-formedness (label + body per item), nesting depth
Paragraphs Structure elements with /S = /P Text organization — whether body text is structurally tagged
MarkInfo & artifacts Catalog /MarkInfo/Marked Whether content is distinguished from artifacts (headers, footers, watermarks)
Role mapping/RoleMap on Catalog or StructTreeRoot Custom tag mappings to standard PDF roles (e.g., Title → H1)
Tab orderPage objects /Tabs Whether keyboard navigation follows the structure tree
Font embedding FontDescriptor /FontFile, /FontFile2, /FontFile3 Whether fonts are embedded (non-embedded fonts can cause garbled text)
Language spans Structure elements with their own /Lang Inline language declarations for foreign-language content
PDF/UA identifier XMP metadata stream (pdfuaid:part) Whether the document claims PDF/UA (ISO 14289) accessibility conformance
Artifact elements Structure elements with /S = /Artifact Decorative content (headers, footers, watermarks) distinguished from real content
ActualText & expansion/ActualText and /E on structure elements Screen reader text overrides for ligatures, symbols, and abbreviation expansions

Tool 2: PDF.js (Content & Metadata Extraction)

PDF.js is Mozilla's open-source JavaScript PDF renderer — the same library that powers Firefox's built-in PDF viewer, used by hundreds of millions of people. While QPDF reads the internal blueprint, PDF.js reads the PDF the way a human would: it renders each page and extracts the actual text content, metadata (title, author, language), and interactive elements like links. It runs server-side via Node.js, processing every page of the uploaded document.

What PDF.js extracts

DataMethodUsed For
Text contentpage.getTextContent() per page Text extractability (minimum 50 chars = "has text")
Title, Author, Languagedoc.getMetadata() Title/language scoring (filename-like titles are rejected)
Links & link textpage.getAnnotations() + spatial text matching Link quality — detects raw URLs vs. descriptive text
Image count (approx.)page.getOperatorList() + image object resolution Fallback image detection when QPDF finds no tagged images — deduplicates per page, filters out images smaller than 50px (spacers, borders). Count is approximate and may include decorative graphics.
Outlinesdoc.getOutline() Bookmark detection (cross-referenced with QPDF)
Empty pagesPer-page text length < 10 chars Detects blank pages or pages with content only as images (may need OCR)

Link text extraction uses a spatial matching algorithm: for each link annotation, PDF.js finds text items whose coordinates fall within the link's bounding rectangle (±5px tolerance), then joins them to determine the visible link text. This is how the tool distinguishes descriptive links ("View the full report") from raw URLs ("https://example.com/report.pdf").

Why Two Tools?

No single open-source library can extract both the low-level PDF structure (tag trees, object references, XObjects) and the rendered text content. Each tool sees a different layer of the document:

QPDF sees:

Structure tags, heading hierarchy, table markup, image objects, form field labels, bookmark chains, reading order markers — the "skeleton" of the document.

PDF.js sees:

Rendered text content, document title and metadata, link URLs and their visible text, page count, image rendering operations — the "surface" of the document as a user would read it.

By cross-referencing both outputs, the scorer can answer questions that neither tool could answer alone. For example: "Does this image have alt text?" requires QPDF to find the image object and its Figure tag, while "Is there any readable text on this page at all?" requires PDF.js to attempt text extraction. Running both tools in parallel hides their individual processing time.

Two-tool parallel analysis
flowchart TD
    A[Uploaded PDF buffer] --> B[Run in parallel]
    B --> C[qpdf: structure tree, language, outlines, images, tables]
    B --> D[pdfjs: text, metadata, content order, per-page details]
    C --> E[QpdfResult]
    D --> F[PdfjsResult]
    E --> G[Scorer]
    F --> G
    G --> H[Weighted score across 9 categories]

The uploaded buffer runs through qpdf (structure tree, language, outlines, images, tables) and pdfjs (text, metadata, content order) in parallel. Their results combine in the scorer for a weighted score across 9 categories.

How Scores Are Calculated

The scorer evaluates up to eleven accessibility categories. Strict weighs nine of them (anchored to WCAG 2.1 AA and IITAA §E205.4) and does not include a PDF/UA category. Practical uses different category weights and additionally weights a PDF/UA Compliance Signals category. Both methodologies evaluate the same document using WCAG guidelines. Each category receives a score from 0 to 100 (or N/A if the category doesn't apply to the document). The overall score is a weighted average of applicable categories, with weights renormalized to exclude N/A categories.

Category Strict weight WCAG + IITAA §E205.4 Practical weight WCAG + PDF/UA
Text Extractability20%17.5%
Title & Language15%13%
Heading Structure15%13%
Alt Text on Images15%13%
PDF/UA Compliance Signals Practical onlyN/A9.5%
Bookmarks / Navigation10%8.5%
Table Markup10%8.5%
Color ContrastN/A4.5%
Link Quality5%4.5%
Reading Order5%4%
Form Accessibility5%4%
Total100%100%

Two scoring methodologies, one document

Both Strict and Practical correctly evaluate the same document using WCAG guidelines. They differ in two ways: category weights, and whether the PDF/UA Compliance Signals category is scored. Both are valid evaluations — neither is “right.”

Strict weighs nine categories anchored to WCAG 2.1 AA and IITAA §E205.4. It does not include a PDF/UA category and emphasizes programmatically determinable structure — real headings, real table-header relationships, and logical reading order.

Practical uses different category weights than Strict and adds a dedicated PDF/UA Compliance Signals category (MarkInfo, tab order, list/table legality, PDF/UA identifiers). It also applies partial-credit floors on heading and table structure. These floors and weights are judgment calls built into this tool, not published standards.

How the two scores relate.Strict is the canonical score. It aligns with WCAG 2.1 Level AA, ADA Title II, and Illinois IITAA §E205.4 — the rules that actually govern non-web document accessibility in Illinois. Practical adds a PDF/UA layer (ISO 14289-1) on top of Strict. PDF/UA is not a legal requirement for final documents under Illinois rules — IITAA references PDF/UA only in §504.2.2, and only for authoring-tool export capability, not for the PDF artifact itself. Groups like DoIT that want the WCAG / IITAA / ADA picture without PDF/UA noise should cite the Strict score.

Strict ≤ Practical, always. Practical starts from the same document evidence and can add points for remediation scaffolding that Strict deliberately ignores (70-point partial-credit floors on heading and table structure) and for PDF/UA signals (MarkInfo, tab order, PDF/UA identifiers). When none of those bonuses apply, Practical equals Strict. Practical can never drop below Strict — the scorer guards that invariant explicitly.

Illinois IITAA 2.1 references PDF/UA in §504.2.2 PDF Export for authoring-tool export capability, while §E205.4 frames final non-web document accessibility through WCAG 2.1. Neither profile is a final legal determination.

Category scoring logic

Text Extractability (20% weight — highest)

What it means: Can a screen reader actually read the words in this PDF? Some PDFs are just pictures of text (scanned documents) — they look normal on screen but are completely invisible to assistive technology.

How it's scored: 100 = extractable text + structure tags + all fonts embedded. 85 (cap) = any non-embedded fonts detected (prevents Pass — non-embedded fonts can cause garbled screen reader output). 50 = text is present but no tags (an untagged PDF). 25 = tags are present but no extractable text (partially remediated scan). 0 = no text and no tags (unremediated scanned image). This category carries the highest weight because if text can't be extracted, nothing else matters.

Title & Language (15%)

What it means: The document title is the first thing a screen reader announces when a user opens the PDF. The language tag controls how the screen reader pronounces words — without it, an English document might be read with a French accent, making it incomprehensible.

How it's scored: 50 points for a meaningful document title (filenames like "report_final.pdf" are automatically rejected as non-meaningful), plus 50 points for a declared language tag. Both are checked in QPDF's catalog /Lang and PDF.js metadata.

Heading Structure (15%)

What it means: Headings (H1, H2, H3, etc.) are how screen reader users navigate and skim documents — the same way sighted users scan bold section titles. Without headings, a blind user must listen to the entire document from start to finish to find the section they need.

How it's scored: 100 = H1–H6 tags present with logical hierarchy (no level skips, exactly one H1). 75 = multiple H1 headings (a document should have exactly one H1 for the title). 60 = numbered headings present but hierarchy is broken (e.g., jumps from H1 to H3 with no H2). 55 = both multiple H1s and hierarchy gaps. 40 = only generic /H tags (not properly numbered H1–H6). 0 = no heading tags at all.

Alt Text on Images (15%)

What it means: Every informative image in a PDF must have "alternative text" — a short description that a screen reader reads aloud. Without alt text, a blind user hears nothing when they encounter a chart, photo, or diagram.

How it's scored: The percentage of detected images that have alt text. QPDF identifies image objects (/Image XObjects) and matches them to their /Figure structure elements, then checks whether each Figure has an /Alt attribute. If QPDF finds no tagged images, PDF.js provides a fallback by counting image rendering operations — if images exist but aren't tagged, the category scores 0 (Critical) instead of N/A. N/A only if no images are detected by either tool.

PDF/UA Compliance Signals Practical only — 9.5%

What it means: A family of PDF/UA-oriented structural signals (tagging, MarkInfo, tab order, PDF/UA identifiers, list/table legality) that some remediation vendors and PAC-style tools weight explicitly. This category is scored in Practical and not in Strict — Strict does not include a PDF/UA category. IITAA §504.2.2 references PDF/UA for authoring-tool export capability, while §E205.4 frames final-document accessibility through WCAG 2.1. Strict therefore surfaces this category as N/A with guidance; Practical includes it in its weighted average.

How it's scored: 0 if the document has no StructTreeRoot (untagged). Otherwise the score starts at 25 for a tagged document and accumulates: MarkInfo /Marked true (+20) or present-only (+10), PDF/UA identifier in metadata (+15), tab order on every page (+10) or some pages (+5), list legality up to +15 based on <Lbl>/<LBody> well-formedness, and table legality up to +15 from row structure, consistent columns, and no nested tables. The numbers here are the original developer's judgment calls, not a published standard, and the total is a readiness signal — not a PDF/UA conformance verdict. PAC and Matterhorn remain the formal conformance checks.

Bookmarks / Navigation (10%)

What it means: Bookmarks act as a clickable table of contents in the PDF viewer's sidebar. For longer documents, they're essential for all users — and required by ADA Title II for documents over a certain length.

How it's scored: N/A for documents under 10 pages (short documents don't require bookmarks). For longer documents: 100 = outline entries present and populated. 25 = outline structure exists but is empty. 0 = no outlines at all. Checked in both QPDF's /Outlines object chain and PDF.js's getOutline().

Table Markup (10%)

What it means: When a sighted user looks at a data table, they can glance at the column headers to understand what each number means. Screen readers need explicit markup to provide the same context — without it, a screen reader reads a flat stream of numbers with no structure. This category checks seven aspects of table accessibility.

How it's scored: N/A if no tables are detected. Seven sub-checks contribute to the score: Header cells (40 pts) — /TH tags present on header cells (most critical). Row structure (20 pts) — cells are grouped in /TR rows. Scope attributes (10 pts) — each /TH has a /Scope (/Column or /Row) so screen readers know which axis the header applies to. No nested tables (10 pts) — nested tables confuse screen reader navigation. Column consistency (10 pts) — all rows have the same number of cells. Caption (5 pts) — /Caption element describes the table's purpose. Header associations (5 pts) — explicit /Headers attributes on data cells for complex table navigation.

Link Quality (5%)

What it means: Screen reader users often navigate by tabbing through links. Hearing "https://www.example.com/documents/2024/report-final-v3.pdf" read aloud character by character is unusable. Descriptive link text like "Download the 2024 Annual Report" tells users where the link goes.

How it's scored: N/A if no links. Percentage of links with descriptive text. A link is flagged as non-descriptive if its visible text starts with http://, https://, or www.. PDF.js extracts the visible text overlapping each link annotation using spatial coordinate matching.

Form Accessibility (5%)

What it means: If a PDF contains fillable form fields (text boxes, checkboxes, dropdowns), each field needs a label that assistive technology can read. Without labels, a screen reader user hears "edit text" or "checkbox" with no indication of what the field is for.

How it's scored: N/A if no form fields. Percentage of widget annotations (form fields) that have a /TU (tooltip) attribute, which serves as the accessible label. QPDF checks both the widget annotation and the /AcroForm fields array.

Reading Order (5%)

What it means: PDFs with multi-column layouts, sidebars, or callout boxes can confuse screen readers if the reading order isn't explicitly defined. A sighted user can see that a sidebar is separate from the main text, but a screen reader reads content in the order defined by the structure tree — if that order is wrong, the document becomes a jumble of unrelated sentences.

How it's scored: 100 = structure tree has depth >1 and fewer than 20% of Marked Content IDs (MCIDs) are out of sequence. 50 = more than 20% of MCIDs are out of order. 30 = structure tree is flat (depth ≤1, indicating minimal structure). 0 = no structure tree at all. MCIDs are numeric identifiers that link content on each page to its position in the tag tree; when they're out of order relative to the page content stream, it indicates a reading order problem.

Supplementary analysis

In addition to the nine scored categories, the tool appends additional findings to relevant categories. Most are informational only, but some (marked below) do affect scoring. These provide deeper insight into the document's accessibility posture.

CheckAppended To What It Reports
List structureReading Order Per-list breakdown of <LI>, <Lbl>, <LBody> presence and nesting depth
Marked content & artifactsText Extractability/MarkInfo status, paragraph tag count, empty page detection
Font embeddingText Extractability Per-font embedded/not-embedded listing — scored: non-embedded fonts cap the category at 85 (Minor)
Role mapping & tab orderReading Order Custom tag role mappings, per-page tab order configuration
Language spansTitle & Language Inline language declarations for foreign-language content within the document
Alt text qualityAlt Text on Images Heuristic check for non-human-readable alt text: hex-encoded data, filenames, generic placeholders, long strings without spaces
PDF/UA identifierText Extractability Checks XMP metadata for pdfuaid:part — indicates if the document claims PDF/UA (ISO 14289) conformance
Artifact taggingText Extractability Counts /Artifact structure elements — headers, footers, and watermarks should be tagged as artifacts so screen readers skip them
ActualText & expansionReading Order/ActualText for glyph/ligature overrides and /E for abbreviation expansions — help screen readers pronounce content correctly
Acrobat remediation guideAll categories When a category scores below "Pass", appends the exact Adobe Acrobat Full Check rule names, menu paths, and step-by-step fix instructions specific to that category

Weight Renormalization

When a category scores N/A (e.g., a text-only document has no images, tables, links, or forms), its weight is redistributed proportionally to the remaining categories. For example, if Alt Text (15%), Table Markup (10%), and Form Fields (5%) are all N/A, the remaining 70% of weights are renormalized to sum to 100%. This ensures documents are only scored on criteria that actually apply to them.

This renormalization is useful because it prevents a text-only file from being unfairly penalized for lacking tables, images, links, or forms that are not present. But it does not make the remaining categories less important. A higher normalized score — especially in Practical mode — can still coexist with unresolved semantic issues that matter for ADA/WCAG/ITTAA review. For Illinois agency publication decisions, normalization is best treated as a scoring convenience, not as a substitute for the stricter category findings.

Scanned Document Detection

A PDF is flagged as a scanned image when both conditions are true: PDF.js extracts fewer than 50 characters of text content (indicating no real text layer) and QPDF finds no StructTreeRoot (indicating no semantic tags). This combination means the document is an unremediated scanned image that screen readers cannot access at all.

PDF Auto-Remediation: Pipeline Overview

As of v1.18.0, the tool also exposes an optional PDF auto-remediation feature behind the REMEDIATION_ENABLED=true env flag. When enabled, the audit results page surfaces an Auto-Remediate this PDF button next to the score. Clicking it spawns a detached worker that runs a four-stage pipeline, validates the output, and either serves the remediated file to the user (single-use download, deleted on stream close) or rejects it and surfaces a fallback message. The user re-uploads to remediate; no PDF is cached between the audit and remediation stages.

POST /api/remediate (multipart PDF) → [magic-byte check] → [page count cap (500)] → [pre-flight audit] → [job row created, sha256 content_hash recorded] → [spawn detached child: tsx src/jobs/remediate.ts <jobId>] → ◄ { jobId, downloadToken } (HTTP 202) Worker pipeline: [Stage 1: preparing] qpdf --object-streams=disable input → normalized [Stage 2: tagging] OpenDataLoader convert(normalized) → tagged-pdf [Stage 3: validating] qpdf --check tagged → validity verdict verapdf --flavour ua1 --format json tagged → conformance verdict [Stage 4: comparing] re-audit tagged → output_audit guard: reject if Overall|Strict|Practical regress Output finalized OR job marked failed. Scratch dir wiped in `finally`.
Remediation pipeline — visual flow
flowchart TD
    A[User clicks Remediate] --> B[Re-upload PDF]
    B --> C[qpdf normalize]
    C --> D[Delete original + verify]
    D --> E[OpenDataLoader tags]
    E --> F[Delete normalized + verify]
    F --> G[qpdf check + veraPDF]
    G --> H[Re-audit + regression guard]
    H --> I[Output ready, 30 min TTL]
    I --> J[Single-use download]
    J --> K[Delete + verify ENOENT]

The user re-uploads the PDF. qpdf normalizes it; original deleted with verification. OpenDataLoader adds structure tags; normalized intermediate deleted with verification. qpdf check + veraPDF validate the output. A re-audit confirms no score profile regressed. If all clear, output is held for 30 minutes; user downloads via single-use token; output deleted with verification.

Why Auditing Is Easy and Remediation Is Hard

Auditing a PDF is a read-only operation: walk the document's internal structure, ask "does it have a tagged StructTreeRoot? Are figures marked? Is the language declared?" and report what you find. The PDF specification (ISO 32000-2) is unambiguous about how to read these structures; the libraries that parse them (qpdf, pdfjs, veraPDF) are mature and battle-tested; the answers don't change between runs. A PDF can be audited a thousand times and produce the same result every time.

Remediation is a read-modify-write operation, and PDFs make that uniquely hard for several reasons that are baked into the format itself:

  1. PDF was designed for fixed-layout printing, not semantic content. Adobe published it in 1993 to make "documents that look identical on every printer." The accessibility layer (StructTreeRoot, marked content, role mapping) was bolted on in PDF 1.4 (2001) and is optional — valid PDFs can have none of it. Auto-tagging means reverse-engineering semantic meaning from raw visual presentation, which is much harder than reading existing semantic markers.
  2. There is no canonical mapping from visual layout to semantic role. Is a 14-pt bold line of text an <H2> or just emphasized body text? Is a 100×100-pixel image content (needs alt text) or decoration (mark as /Artifact)? A human reader judges from context; software guesses heuristically and is wrong some of the time.
  3. The content stream and the structure tree are coupled but separable. Every glyph and image in a PDF lives in a per-page content stream. Each one is wrapped in a "marked content" section (/MCID 7 … /EMC) that links it back to a node in the StructTreeRoot. Adding an alt-text to one image means mutating both sides coherently — write the new /Alt property on the Figure structure element AND ensure the MCID linkage stays valid. Many PDF libraries handle reading one side or the other, but not modifying both at once.
  4. The content layer can be in any of several representations. A scanned PDF has no text layer — it's just raster images, requiring OCR before any semantic remediation can happen. An optimized PDF compresses objects into "object streams" (a PDF 1.5+ feature) that some libraries can't safely round-trip. An encrypted PDF requires a password even to read. Each case is its own engineering minefield, and they layer onto each other (scanned-and-encrypted is worse than either alone).
  5. No single PDF library does everything well.pdf-lib (JavaScript, in the Node ecosystem) reads and writes metadata easily but has no StructTreeRoot builder. Apache PDFBox (Java) has the deepest structure-tree support but is Java-only. Ghostscript can rewrite PDFs but silently degrades tag structure. OpenDataLoader (Java, used here) is the only open-source tool that produces a tagged PDF from an untagged one — and even it cannot judge whether the result is meaningful.
  6. The "tagged PDF" specification is permissive. You can produce a PDF that satisfies all the technical requirements of being tagged (MarkInfo=true, StructTreeRoot exists, every page has marked content) and is still inaccessible to screen readers (e.g., every paragraph wrapped in a single <P> with no heading structure). PDF/UA-1 (ISO 14289-1) narrows this somewhat but doesn't eliminate it. Automated remediation tools often produce tagged-but-shallow output that machine validators accept but assistive technology can't navigate.
  7. Mistakes compound badly. A wrong heading level might confuse a screen reader user. A corrupted cross-reference (xref) table makes the entire PDF unreadable by any viewer. Remediation tools have to be conservative — when in doubt, don't touch. The qpdf preprocessing step in this pipeline exists precisely because OpenDataLoader's PDF writer occasionally corrupts the xref on round-trip with certain inputs (the InDesign 18.x / Word 365 case described above); we accept the cost of an extra normalization pass to avoid serving a damaged file.
  8. Round-trip fidelity is the highest bar. Remediation must add semantic markup while preserving every visual nuance: embedded fonts, raster + vector images, color spaces, ICC profiles, page labels, bookmarks, hyperlinks, form fields, digital signatures, embedded multimedia. The user doesn't want their report to look different after remediation; they want the same document with structure added. Read-modify-write while changing only the semantic layer is a class of problem the format simply wasn't designed to make easy.

The result is that PDF auto-remediation works well for the machine-checkable parts of accessibility (structure presence, metadata, language declaration, tagged content stream) and falls back to human judgment for the semantically-judged parts (alt-text quality, reading-order intent, decorative vs. informative classification). The roadmap for this tool (see docs/pdf-remediation-alt-text-walkthrough-spec.md) is an interactive walkthrough that augments the machine-checkable foundation with human-authored alt text — without any AI in the loop, because the regulatory durability of agency-authored content is higher than the durability of AI-generated content.

Why OpenDataLoader Changes the Cost Equation

Until 2024–2025, programmatically tagging a PDF (auto-generating StructTreeRoot, marking figures, tables, headings) was something only a handful of commercial vendors could do, and they priced accordingly. The economics of PDF accessibility have historically been brutal for state agencies: PDF/UA expertise is rare, specialized, and was locked behind commercial walls for decades.

Commercial PDF remediation, today:

  • Apryse / PDFTron SDK: enterprise-quoted, typically $1,500/yr minimum for the entry SDK and considerably more for the auto-tagging add-on. On-prem deployable but you pay for the privilege of running their Java/C++ binary in your own data center.
  • Adobe PDF Services API: Accessibility Auto-Tag endpoint, free tier of 500 transactions per month (about 50 pages — exhausted by a single annual report). Beyond the free tier: enterprise-quoted, scaling per-document. Your PDF leaves your network for the API call.
  • PDFix SDK, AbleDocs ADapi, CommonLook API: all enterprise-quoted, all opaque pricing, all aimed at large organizations.
  • Manual remediation services:$5–$50 per page for hand-remediation of tagged-and-reviewed output. A typical 50-page agency report costs $250–$2,500 to remediate this way, and that's per document. State agencies producing dozens of reports per year face annual remediation bills in the tens of thousands.

Why so expensive? The skill is rare — there are relatively few practitioners who can read a structure tree and judge whether it's correct. The labor is real — even with good tooling, a 50-page report can require 4–8 hours of expert work. The market is small, the demand is regulated (ADA Title II, IITAA, Section 508), and the buyers are mostly governments and large organizations that aren't price-sensitive. The result is a niche industry with high prices and slow innovation.

OpenDataLoader PDF, released as Apache 2.0 in 2024 and continuously developed since, is the first credible open-source PDF auto-tagger. It does what previously required a $1,500/year SDK subscription: takes an untagged PDF and produces a tagged one. It's developed by Hancom (a Korean office-software vendor with deep PDF expertise) in collaboration with the PDF Association and Dual Lab (the same people behind the veraPDF validator). It ranks #1 overall (0.907) in 2026 PDF-extraction accuracy benchmarks — not just "as good as the commercial tools," better than them on the published metrics.

For this tool, OpenDataLoader is load-bearing. The pipeline architecture (qpdf preprocess → ODL tag → veraPDF check → re-audit) takes the most expensive part of commercial PDF remediation — the auto-tagging step — and replaces it with an apt install openjdk-17-jre-headless. The other open-source tools we pair it with (qpdf for preprocessing, veraPDF for PDF/UA-1 conformance validation) are also free and mature. Together they form a complete pipeline that until very recently did not exist in open source.

What ODL doesn't do — and no auto-tagger does — is judge whether the resulting structure is meaningful. It can mark every image as a Figure but can't write an alt-text. It can mark every table cell but can't decide which row is the header. Those remain human judgment calls. The economic shift ODL enables is from "$1,500/year + per-document manual labor" to "$0 of software + the manual labor for the parts a machine genuinely cannot do." That's an order-of-magnitude cost reduction for the agencies it serves, with no loss of output quality.

Tool 3: OpenDataLoader PDF (Auto-Tagging)

OpenDataLoader PDF (ODL) is an Apache-2.0-licensed Java application that takes an untagged PDF and writes a Tagged PDF with a populated StructTreeRoot. It is the first open-source tool to offer this transformation; it ranks #1 overall (0.907) in 2026 PDF-extraction benchmarks across reading order, table extraction, and heading detection. ICJIA maintains a fork at ICJIA/opendataloader-pdf as a hedge against future license changes upstream.

  • Invocation:@opendataloader/pdf v2.4.3 npm wrapper around a bundled JAR (lib/opendataloader-pdf-cli.jar).
  • Runtime: OpenJDK 17+ (java -version ≥ 11 required; install via apt install openjdk-17-jre-headless on Ubuntu 22.x).
  • JVM heap cap:JAVA_TOOL_OPTIONS=-Xmx768m set per-invocation by the worker as a safety rail against pathological documents.
  • Convert options used:{ outputDir, format: 'tagged-pdf', quiet: true }. Hybrid mode (docling-fast + SmolVLM) is deliberately not used in v1 — see the spike report for why.
  • Wall-clock timeout:REMEDIATION.WORKER_TIMEOUT_MS (5 min default); the JVM child is killed on overrun.

Why a Java tool in a Node.js codebase: every other open-source PDF/UA-targeted auto-tagger is either commercial (Apryse, Adobe PDF Services API), Java-only, or both. The tradeoff is one additional system dependency (JRE) on the deploy box in exchange for free, locally-hosted auto-tagging with no outbound API calls.

qpdf Preprocessing: --object-streams=disable

Stage 1 of the remediation pipeline pipes the input through qpdf --object-streams=disable INPUT NORMALIZED before ODL ever touches it. This decompresses PDF 1.5+ compressed object streams to traditional uncompressed objects. Without this preprocessing, ODL's Java PDF writer corrupts the output xref table on certain inputs — specifically, tagged PDFs emitted by modern Adobe InDesign (18.x) and Microsoft Word 365.

This bug was discovered during the OpenDataLoader feasibility spike on the FY_22_ICJIA_Annual_Report (InDesign 18.2) and 2022 SFS Process Evaluation Report (Word 365) fixtures. Without preprocessing, ODL emits a PDF that qpdf --check reports as damaged: xref num N not found, Invalid object stream, Catalog object is wrong type (null). With preprocessing, both PDFs round-trip cleanly and the score moves from F to D-grade improvement. See docs/spike-remediation-results.md for the full reproducer + results.

Output Validation: qpdf --check + veraPDF

Every remediated PDF passes through two independent validators before the worker is allowed to serve it. The output is rejected (job marked failed, file deleted) on any failure, even though the upstream pipeline succeeded.

  • qpdf --check <output>: parses the entire PDF structure and reports warnings on damaged xref tables, malformed object streams, broken catalogs, etc. The worker treats "operation succeeded with warnings" as a failure — better to discard a borderline file than serve a damaged one.
  • verapdf --flavour ua1 --format json <output>: runs the veraPDF open-source PDF/UA-1 conformance validator (from the PDF Association + Dual Lab). Configured via REMEDIATION_VERAPDF_PATH; optional — when not configured, the receipt records verapdf_unavailable and skips this step. veraPDF's verdict is informational, not blocking: even a PDF that veraPDF flags as non-conformant is still served if the audit score didn't regress. The result page surfaces this honestly in the IITAA compliance disclaimer.

Regression Guards

After successful tagging + validation, the worker re-audits the output and compares against the pre-flight audit stored at job creation time. Three independent comparisons run:

if (output.overallScore < input.overallScore || output.scoreProfiles.strict.overallScore < input.scoreProfiles.strict.overallScore || output.scoreProfiles.remediation.overallScore < input.scoreProfiles.remediation.overallScore) { recordEvent(jobId, 'validation_failed', { regressed_profiles: [...] }) await deleteAndVerify(jobId, taggedPath, 'cleanup') setFailed(jobId, `auto-remediation regressed: ${regressed.join(', ')}`) return }

Why all three: the headline overall score uses whichever profile is the active scoring mode, which can mask a regression on the other profile. Checking both profiles plus the displayed overall ensures the user never sees a metric that decreased. The validation_failed event payload records all six numbers (input/output × overall + strict + practical) plus the regressed_profiles array, so any auditor query can identify exactly which profile failed and by how much.

Lifecycle Audit Trail: remediation_events

Every remediation produces an append-only series of timestamped lifecycle events in the remediation_events SQLite table (apps/api/data/audit.db). The table is the canonical source for the receipt displayed on the result page, the auditor evidence trail, and any future compliance reporting. PDF content is never stored — only structural metadata.

CREATE TABLE remediation_events ( id INTEGER PRIMARY KEY AUTOINCREMENT, job_id TEXT NOT NULL, event TEXT NOT NULL, occurred_at INTEGER NOT NULL, details TEXT, -- JSON, content-free metadata only FOREIGN KEY (job_id) REFERENCES remediation_jobs(id) );

Event vocabulary (closed set, typed at compile time):

  • received
  • processing_started
  • normalize_complete
  • input_deleted
  • tagging_complete
  • intermediate_deleted
  • validation_passed
  • validation_failed
  • verapdf_passed
  • verapdf_failed
  • verapdf_unavailable
  • output_ready
  • downloaded
  • output_deleted
  • verified_absent
  • verify_failed
  • expired
  • error

The verified_absent event is the critical compliance signal. It is emitted after the worker (or cleanup sweep, or download handler) calls fs.unlink on a job artifact AND fs.stat returns ENOENT. The details payload contains a SHA-256 hash of the deleted path string (not file content) so auditors can reconcile event entries against expected paths without storing the paths themselves in the log.

Privacy & Retention (Remediation-Specific)

The remediation pipeline maintains the same posture as the audit pipeline (no PDF content persisted) with three additional rules:

  1. No between-stage cache. The just-audited PDF is not cached on disk waiting for the user to click Remediate. Clicking Remediate prompts a re-upload. UX cost: one extra upload. Privacy cost of caching: declined.
  2. Inputs deleted between pipeline stages. The worker writes work/input.pdf, normalizes it to work/normalized.pdf, then deleteAndVerify(work/input.pdf). Once ODL produces work/odl/<name>_tagged.pdf, the normalized intermediate is deleted. At any moment, at most one copy of the PDF exists on disk per job. The entire scratch dir is wiped in a finally block regardless of pipeline outcome.
  3. Output deleted on first download.GET /api/remediate/:id/download streams via createReadStream + pipe (no memory buffering); the response 'close' handler triggers deleteAndVerify(outputPath, 'download'). The job row is marked status='expired' before the stream begins, so a concurrent second download request sees 410 Gone. Files not downloaded within REMEDIATION.OUTPUT_TTL_MS (30 min default) are deleted by the cleanup sweep.

Filesystem permissions are 0700 on apps/api/data/remediation/ and 0600 on output files. Output filenames are <jobId>.pdf where jobId is a UUIDv4 (122 bits of entropy) — not derivable from the user's input. The remediation_events rows are retained per REMEDIATION.EVENT_LOG_RETENTION_DAYS (7 years default — typical state-agency records-retention schedule); the remediation_jobs row is purged separately at REMEDIATION.JOB_ROW_RETENTION_DAYS (30 days default).

Deploy Topology (Ubuntu 22.04 + PM2 + Nginx + DigitalOcean)

The API spawns the worker via spawn(process.execPath, ['--import', 'tsx', WORKER_PATH, jobId], { detached: true, stdio: 'ignore' }).unref(). PM2 does not manage the worker — it's a transient child of the API process, killed by the OS when the pipeline completes or crashes. Worker stdout is suppressed; all signals flow through the database (remediation_jobs.status, progress_pct, step) which the frontend polls via GET /api/remediate/:id/status every 2 seconds.

System packages required on the deploy box:qpdf ≥ 10.x, openjdk-17-jre-headless, and (optional) the veraPDF CLI from verapdf.org. The rebuild.sh preflight verifies all three on every deploy and emits warnings if any are missing or below required version. The feature flag REMEDIATION_ENABLED is forwarded from the parent shell through ecosystem.config.cjs's env block, so the deploy idiom is:

sudo apt install -y openjdk-17-jre-headless # one-time echo 'REMEDIATION_ENABLED=true' | sudo tee -a /etc/environment source /etc/environment ./rebuild.sh # pulls, builds, pm2 restart # Rollback to audit-only without redeploying: sudo sed -i '/^REMEDIATION_ENABLED=/d' /etc/environment pm2 restart ecosystem.config.cjs

Limitations & What This Tool Cannot Do

This tool provides a thorough automated assessment, but no automated tool can fully replace manual accessibility testing. Important limitations:

  • 1.Alt text quality: The tool detects whether alt text exists and runs a heuristic check for obviously poor alt text (hex-encoded strings, filenames like "IMG_001.jpg", generic placeholders like "image", and long strings without spaces). However, it cannot evaluate whether alt text is semantically meaningful — for example, "a chart" technically passes all automated checks, but "Bar chart showing 2024 crime rates by county" is far more useful. Human review is still needed to assess alt text quality beyond the heuristic flags.
  • 2.Color contrast: PDF color contrast analysis requires rendering each page as an image and analyzing pixel colors. This tool focuses on structural accessibility (tags, metadata, markup) and does not currently assess color contrast.
  • 3.Natural language clarity: The tool cannot evaluate whether the text itself is written clearly. WCAG 3.1.5 recommends content be written at a lower secondary education reading level — this requires human judgment.
  • 4.Decorative images: Not all images need alt text — decorative images should be marked as artifacts. The tool cannot distinguish informative images from decorative ones; it reports all images without alt text as a potential issue.
  • 5.Complex layouts: While reading order is assessed via MCID sequence analysis, extremely complex layouts (e.g., multi-column magazine spreads, nested pull quotes) may have subtle ordering issues that the 20% disorder threshold doesn't catch.

For a complete accessibility evaluation, this tool's automated analysis should be supplemented with manual testing using an actual screen reader (e.g., NVDA, JAWS, or VoiceOver) and the Adobe Acrobat Accessibility Checker.

These limitations apply to auto-remediation too. When the optional auto-remediation feature runs, OpenDataLoader can add a /Figure structure element for an image — but it cannot author a meaningful description. The same human-judgment gap applies to color contrast, reading-order ambiguity in multi-column layouts, distinguishing decorative from informative images, and writing text at a clear reading level. Auto-remediation is genuinely helpful for the machine-checkable parts of accessibility (structure, metadata, language declaration); it is not a substitute for the human-judgment parts. The result page is explicit about this in the IITAA compliance disclaimer.

Privacy & Security

The application is hosted on DigitalOcean cloud infrastructure (managed via Laravel Forge). When you upload a PDF:

  • 1.The file is written to a temporary directory on the server, analyzed by QPDF and PDF.js, and immediately deleted — no PDF content is retained after analysis completes.
  • 2.The file exists in server memory only for the duration of analysis (typically under 10 seconds).
  • 3.No PDF data is transmitted to external APIs, cloud services, or AI models — all analysis runs on the server itself.
  • 4.Encrypted (password-protected) PDFs are rejected with a clear error before analysis begins.
  • 5.A concurrency semaphore limits the server to two simultaneous analyses to prevent resource exhaustion.

Shared reports: When you click "Share Report," the analysis results only — scores, category findings, grade, metadata (title, author, page count) — are saved to a SQLite database file on the same DigitalOcean droplet. Specifically:

  • The original PDF file is never saved — only the structured audit results (JSON) are stored.
  • Shared links expire after 365 days. After expiration, the stored results are eligible for permanent deletion. The 365-day window is sized for the auditor / fleet inventory use case — fleet reports run on a multi-month cadence and reviewers need report links to stay valid for at least a year. Older results are deleted by the periodic cleanup sweep.
  • Anyone with the link can view the report without logging in. No account is required to view a shared report.
  • The database is stored locally on the server filesystem — it is not replicated to external storage or backup services.

When auto-remediation is enabled (the optional v1.18.0 feature behind REMEDIATION_ENABLED=true), the file lifecycle differs from a plain audit. The remediation worker needs the PDF on disk briefly to run external tools (qpdf, OpenDataLoader, veraPDF). The posture remains "as short-lived as the work requires, then deleted with verification":

  • No between-stage cache. A PDF is never stored on disk waiting for the user to click "Remediate" after an audit. Clicking the button prompts a fresh multipart upload — the just-audited buffer is not preserved server-side.
  • Inputs deleted between pipeline stages. After qpdf normalizes the uploaded file, the original input is deleted. After OpenDataLoader produces the tagged output, the normalized intermediate is deleted. At any moment, at most one copy of the PDF exists on disk per job. The entire scratch directory is wiped in a finally block regardless of pipeline outcome (including crashes).
  • Output deleted on first download. The remediated PDF is served via a single-use download token. The file is deleted as soon as the response stream closes, and an fs.stat call verifies the deletion succeeded (the verified_absent event in the audit log is the auditor evidence). Concurrent or repeat download attempts return 410 Gone.
  • Maximum 30-minute output retention. If the user never downloads, a cleanup sweep removes the file after REMEDIATION.OUTPUT_TTL_MS (default 30 minutes) and marks the job status='expired'.
  • Lifecycle events contain no PDF content. Each step (received, normalize_complete, tagging_complete, validation_passed, output_ready, downloaded, output_deleted, verified_absent, etc.) writes a row to remediation_events with a server-side timestamp and a JSON payload of structural metadata only. File paths are recorded as SHA-256 hashes rather than literal strings.
  • No external API calls. The remediation pipeline runs entirely on this server. OpenDataLoader and veraPDF execute locally; the file never leaves the droplet. AI-based alt text generation (which would call a hosted vision API) is explicitly not used in v1 — see the docs/pdf-remediation-alt-text-walkthrough-spec.md roadmap document for the AI-free Phase 1 approach.
  • Per-user concurrency limit. Each user can have at most one remediation job in flight at a time (REMEDIATION.MAX_CONCURRENT_JOBS_PER_USER). The 50 MB file-size cap, 500-page count cap, 5-minute wall-clock timeout, and 768 MB JVM heap cap are additional resource-exhaustion guards.

Verify for yourself: The complete source code for the analysis and auto-remediation pipelines is open source.

What This Tool Does

Audit any PDF for WCAG 2.1 AA accessibility — and (optionally) auto-remediate it, all on infrastructure you control, with no AI and no per-document fees.

9
WCAG categories audited

Each PDF scored across 9 categories aligned with WCAG 2.1 Level AA and ADA Title II. A–F letter grade plus Critical / Serious / Moderate severity per category so you know what to fix first.

F → A
Auto-remediation (optional)

Tag untagged PDFs in seconds with the qpdf → OpenDataLoaderveraPDF pipeline. Output never regresses any score profile, and manual review is still recommended for IITAA compliance.

PDF/UA-1
Standards aligned

WCAG 2.1 Level AA, ADA Title II (effective April 2026), Illinois IITAA, and PDF/UA-1 (ISO 14289-1) via veraPDF. Full lifecycle audit trail with deletion verification for compliance reporting.

0
PDFs retained

Uploaded files exist on the server only as long as the pipeline requires. Audited files: in-memory, gone in seconds. Remediated outputs: deleted on first download or 30-minute TTL, then fs.stat-verified absent.

$0
No AI, no third-party APIs

Every step runs on this server. No data is sent to vision models, hosted AI services, or commercial PDF SDKs. The toolchain (qpdf, pdfjs, OpenDataLoader, veraPDF) is entirely open source — no per-document fees, no SDK licensing.

100%
Open source

Every line of code is on GitHub — fork it, audit it, run it on your own infrastructure. Underlying tools use Apache 2.0 / MIT / MPL licenses. Designed for state agencies that need control over their accessibility pipeline.