Accessibility Audit

Technical Details

How This Tool Works

A diagram-led tour of the audit and auto-remediation pipelines. For the full prose treatment with code references, see the Technical Details expandable on the audit page. For the legal/compliance-facing version, see the data retention policy.

1. What this tool does

The tool answers two questions about a PDF:

  • Audit: "How accessible is this PDF, and what specifically is wrong?" — scored across 9 WCAG-aligned categories, A through F grade, with category-level findings.
  • Auto-remediate (optional, opt-in): "Can we add accessibility structure to this PDF without making it worse?" — runs an automated tagging pipeline, validates the output, and serves the improved file if every score profile holds or improves.

Both happen on a single DigitalOcean server controlled by ICJIA. Nothing leaves the server. No AI service is contacted at any point.

2. The audit pipeline

The audit holds the uploaded PDF in memory only — never written to disk. Two open-source tools (qpdf for structure, pdfjs for content) run in parallel against the in-memory buffer; their combined output feeds the scorer.

Audit pipeline
flowchart TD
    A[Browser uploads PDF] --> B[Magic-byte + size check]
    B --> C[Hold in memory only]
    C --> D[qpdf analyzes structure]
    C --> E[pdfjs extracts content]
    D --> F[Combined results]
    E --> F
    F --> G[Scorer applies 9 categories]
    G --> H[A-F grade + findings]
    H --> I[Return to browser]

Browser uploads a PDF; the server validates magic bytes and size, holds the file in memory, runs qpdf and pdfjs in parallel against it, combines the results in the scorer, produces a grade plus category findings, returns to the browser, and discards the memory buffer.

Why two tools? Each is excellent at a different job. qpdf parses the PDF's internal object graph and structure tree — the parts a screen reader cares about. pdfjs (Mozilla's PDF rendering library) is excellent at extracting visible text and content order. Running both gives the scorer a richer signal than either alone.

3. Why two tools?

Each tool sees the PDF differently. Running them in parallel lets the scorer reconcile a structural view (qpdf) with a content view (pdfjs) — useful for catching cases where a PDF claims structure it doesn't actually have, or vice versa.

Two-tool parallel analysis
flowchart TD
    A[Uploaded PDF buffer] --> B[Run in parallel]
    B --> C[qpdf: structure tree, language, outlines, images, tables]
    B --> D[pdfjs: text, metadata, content order, per-page details]
    C --> E[QpdfResult]
    D --> F[PdfjsResult]
    E --> G[Scorer]
    F --> G
    G --> H[Weighted score across 9 categories]

The uploaded buffer runs through qpdf (structure tree, language, outlines, images, tables) and pdfjs (text, metadata, content order) in parallel. Their results combine in the scorer for a weighted score across 9 categories.

4. What is a PDF, really?

A PDF is an export format, not a source format. Adobe created PDF in 1993 to solve "make a document look identical on every printer." You don't write in a PDF — you write in Word, InDesign, Pages, or Google Docs, and you export to PDF when you want to share the finished result.

That export step is where accessibility is won or lost. PDF natively stores where every glyph appears on the page — not what those glyphs mean. The semantic layer that makes a PDF accessible (the structure tree, added in PDF 1.4 in 2001) is optional. It only gets created if the export tool explicitly emits it (Word's "Document structure tags for accessibility," InDesign's "Create Tagged PDF," etc.).

Without tags, a PDF reads like raw glyph positions to a screen reader — incoherent. With tags, it reads like a navigable document with headings, paragraphs, lists, tables, and image descriptions.

5. Why remediation is fundamentally limited

Auditing is read-only — you walk the PDF's internal structure and report what's there. Remediation is read-modify-write — and PDFs make that genuinely hard:

  • PDF was designed for print, not semantics. The structure tree was bolted on in 2001 and is optional.
  • No canonical mapping from layout to semantics. Is a 14-pt bold line a heading or just emphasis? Software has to guess.
  • Content stream and structure tree are coupled. Adding alt-text means mutating both sides coherently — most libraries handle one but not both.
  • Tagged PDF spec is permissive. A PDF can satisfy the technical requirements for "being tagged" and still be inaccessible (e.g., every paragraph wrapped in <P> with no heading structure).
  • Mistakes compound. A wrong heading misleads a screen reader; a corrupted cross-reference makes the entire PDF unreadable.
  • Round-trip fidelity is the highest bar. Remediation must add semantic markup while preserving every visual nuance.

Result: PDF auto-remediation works well for the machine-checkable parts of accessibility (structure presence, metadata, language declaration). It falls back to human judgment for the semantically-judged parts (alt-text quality, reading-order intent, decorative vs. informative classification).

6. The remediation pipeline

When the optional auto-remediation feature is enabled, clicking Remediate triggers a multi-stage pipeline. Every intermediate file is deleted before the next stage starts, and every deletion is verified by an fs.stat ENOENT check.

Remediation pipeline
flowchart TD
    A[User clicks Remediate] --> B[Re-upload PDF]
    B --> C[qpdf normalize]
    C --> D[Delete original + verify]
    D --> E[OpenDataLoader tags]
    E --> F[Delete normalized + verify]
    F --> G[qpdf check + veraPDF]
    G --> H[Re-audit + regression guard]
    H --> I[Output ready, 30 min TTL]
    I --> J[Single-use download]
    J --> K[Delete + verify ENOENT]

The user re-uploads the PDF. qpdf normalizes it; original deleted with verification. OpenDataLoader adds structure tags; normalized intermediate deleted with verification. qpdf check + veraPDF validate the output. A re-audit confirms no score profile regressed. If all clear, output is held for 30 minutes; user downloads via single-use token; output deleted with verification.

Why qpdf normalize first? OpenDataLoader's PDF writer corrupts the output xref table on certain modern Adobe InDesign 18 and Microsoft Word 365 inputs. The qpdf --object-streams=disable step decompresses object streams before ODL touches the file, which works around the bug entirely. This was discovered during the v1.18.0 feasibility spike — see docs/spike-remediation-results.md.

7. Application architecture

Two small services run on a single DigitalOcean droplet, managed by PM2 and fronted by Nginx. Every external dependency is open source and runs locally.

Application architecture
flowchart TD
    A[Browser] --> B[Nginx reverse proxy]
    B --> C[Nuxt web app on port 5102]
    B --> D[Express API on port 5103]
    C --> D
    D --> E[qpdf binary]
    D --> F[OpenDataLoader Java]
    D --> G[veraPDF Java]
    D --> H[SQLite database]

Browser talks to Nginx reverse proxy. Nginx routes to either the Nuxt web app (port 5102) or the Express API (port 5103). The web app makes some API calls back to Express. Express shells out to qpdf, OpenDataLoader Java, and veraPDF Java; it reads and writes SQLite locally. No external services.

8. The open-source toolchain

ToolJobLicensePipeline
qpdfStructure parsing + PDF normalizationApache 2.0Audit + Remediation
pdfjs-distText + metadata extractionApache 2.0Audit
OpenDataLoader PDFRule-based PDF auto-taggingApache 2.0Remediation
veraPDFPDF/UA-1 (ISO 14289-1) validationMPL 2.0Remediation

Why OpenDataLoader matters: commercial PDF auto-tagging SDKs (Apryse, Adobe PDF Services, PDFix, CommonLook) start at ~$1,500/year and most are enterprise-quoted opaque pricing. OpenDataLoader, released as Apache 2.0 in 2024, is the first credible open-source alternative — and it ranks #1 overall (0.907) in 2026 PDF-extraction benchmarks. For this tool, it replaces what previously cost thousands a year with an apt install openjdk-17-jre-headless.

Related documents