Technical Details

How This Tool Works

A diagram-led tour of the audit and auto-remediation pipelines. For the full prose treatment with code references, see the Technical Details expandable on the audit page. For the legal/compliance-facing version, see the data retention policy.

1. What this tool does

The tool answers two questions about a document (PDF, Word .docx, PowerPoint .pptx, or Excel .xlsx):

Audit (PDF, Word, PowerPoint, and Excel): "How accessible is this document, and what specifically is wrong?" — a weighted 0–100 score (A–F grade) across the WCAG-aligned categories that apply to the format, a separate pass/fail WCAG 2.2 conformance verdict, and category-level findings.
Auto-remediate (PDF only; optional, opt-in): "Can we add accessibility structure to this PDF without making it worse?" — runs an automated tagging pipeline, validates the output, and serves the improved file if every score profile holds or improves.

Both happen on a single DigitalOcean server controlled by ICJIA. Nothing leaves the server. No AI service is contacted at any point.

2. The audit pipeline

The audit holds the uploaded file in memory and never persists it, and the server detects the format from the file's content (not its name). PDFs are read by two tools: pdfjs reads the in-memory buffer directly, while qpdf (a command-line tool that needs a file path) gets a short-lived temp copy under a random name, deleted in the same request — even when analysis fails. The two run in parallel and their combined output feeds the scorer.

Word (.docx), PowerPoint (.pptx), and Excel (.xlsx) files take a simpler, fully in-process route: each is just a ZIP of XML (Office Open XML), so the server unzips it in memory and reads the accessibility-relevant structure directly with two small JavaScript libraries (JSZip + fast-xml-parser) — no external binary, no subprocess, and no temp file at all. The extracted structure feeds the same scorer. Nothing is uploaded to a directory, cached, or retained in either path. The flowchart below shows both paths.

Audit pipeline — PDF, Word, PowerPoint, and Excel

Why two tools for PDF? Each is excellent at a different job. qpdf parses the PDF's internal object graph and structure tree — the parts a screen reader cares about. pdfjs (Mozilla's PDF rendering library) is excellent at extracting visible text and content order. Running both gives the scorer a richer signal than either alone. The Office formats need only one parser, because their structure (headings or slide titles, alt text, table headers, sheet names) is already explicit in the OOXML — there is no separate visual layer to reconcile against.

The score, and the conformance verdict

The scorer produces two separate things, and the distinction matters:

The 0–100 score (A–F grade) is a weighted, partial-credit prioritised-readiness metric — it shows how close a document is and what to fix first. Weights reflect WCAG priority: text extractability carries the most (a scanned PDF is unusable, so nothing else matters), and reading order is weighted as a Level-A essential because out-of-order content makes a document unusable. Bookmarks — which map to the Level-AA "multiple ways" criterion and can be partly substituted by a clear heading structure — carry less.
The WCAG 2.2 conformance verdict is a separate, binary pass/fail. WCAG conformance is all-or-nothing per success criterion — one image without alt text fails 1.1.1 (Level A) outright — so a weighted score with partial credit cannot be a conformance claim. A document can score 90+ ("A") and still fail WCAG. The verdict reports confirmed, machine-checkable failures; when it finds none it says exactly that — not "conformant", because color contrast, reading-order nuance, and the correctness of alt text and tags still require a human reviewer.

Each category maps to the specific WCAG 2.2 success criteria it evaluates. The weights below are the PDF rubric; the Office formats (Word, PowerPoint, Excel) use format-specific category sets, noted after the table:

PDF scoring categories, their weights, and the WCAG success criteria they evaluate
Category	Weight	WCAG 2.2 success criteria
Text Extractability	20%	1.1.1, 1.3.1 (A)
Title & Language	15%	2.4.2, 3.1.1 (A)
Heading Structure	15%	1.3.1 (A), 2.4.6 (AA)
Alt Text on Images	15%	1.1.1 (A)
Table Markup	10%	1.3.1 (A)
Reading Order	10%	1.3.2 (A)
Bookmarks / Navigation	5%	2.4.5 (AA)
Link Quality	5%	2.4.4 (A)
Form Accessibility	5%	1.3.1, 3.3.2, 4.1.2 (A)

For PDFs, color contrast (WCAG 1.4.3) is shown as Not assessed — the tool does not yet measure rendered PDF contrast, and that is stated plainly rather than hidden as a passing result. Reading Order and Alt Text can also report Not assessed when the tool lacks the data to judge them honestly (no comparable tag/content order; images present but none tagged). A category reads Not applicable only when the document genuinely has no such content (no tables, no forms, no links). In both cases the category's weight is redistributed across the categories that were actually scored.

Word (.docx) differs in three ways. Color contrast is scored — Word stores explicit and theme colors in the file, so 1.4.3 is machine-checkable (unlike PDF, which would need pixel rendering). A Word-specific List Structure category (1.3.1 — real lists vs. manually-typed bullets) applies in place of PDF-only Bookmarks. And Reading Order and Form Accessibility are Not applicable, because Word's linear document flow preserves reading order and interactive form controls are rare in Word.

PowerPoint (.pptx) swaps in slide-centric categories. A Slide Titles category (2.4.2 — every slide needs a unique title placeholder, Microsoft's highest-severity PowerPoint rule) applies in place of heading structure, and Reading Order is actively checked (1.3.2 — the title should be the first shape a screen reader encounters on each slide). Color contrast and list structure are scored as for Word; bookmarks and form accessibility don't apply to presentations and are omitted.

Excel (.xlsx) is table-first. A Sheet Names category (no default "Sheet1" tabs on visible sheets) applies in place of heading structure, and Table Markup carries the most weight — data belongs in real table objects with header rows, and merged cells are flagged as advisories. Excel stores no document-language property, so Title & Language scores on the title alone and the language half is reported as not assessed. Reading order, lists, bookmarks, and forms don't apply and are omitted.

WCAG 2.2 alignment

This tool reports against WCAG 2.2 Level AA, a strict superset of the WCAG 2.1 AA that IITAA 2.1 (§E205.4) and ADA Title II require. WCAG 2.2 adds nine success criteria (six at Level A/AA) and removes one (4.1.1 Parsing, obsolete). The automated checks are unchanged — every machine-checkable criterion carried forward from 2.1. The new 2.2 criteria are interactive/manual; we never report them as automated failures. For documents with interactive form fields, the form-relevant new criteria (Target Size 2.5.8, Redundant Entry 3.3.7, Accessible Authentication 3.3.8) are listed in the verdict as "not assessed — manual review".

For a plain-language manager summary, see how WCAG 2.2 differs from 2.1. IITAA 2.1 does not yet reference WCAG 2.2, so 2.2 conformance is optional/forward-looking; WCAG 2.1 AA remains the legal minimum.

3. Why two tools (for PDF)?

This applies to PDF only. Each tool sees the PDF differently, and running them in parallel lets the scorer reconcile a structural view (qpdf) with a content view (pdfjs) — useful for catching cases where a PDF claims structure it doesn't actually have, or vice versa. Word, PowerPoint, and Excel need no such reconciliation: their structure is declared explicitly in the OOXML, so a single parser reads it directly.

4. What is a PDF, really?

A PDF is an export format, not a source format. Adobe created PDF in 1993 to solve "make a document look identical on every printer." You don't write in a PDF — you write in Word, InDesign, Pages, or Google Docs, and you export to PDF when you want to share the finished result.

That export step is where accessibility is won or lost. PDF natively stores where every glyph appears on the page — not what those glyphs mean. The semantic layer that makes a PDF accessible (the structure tree, added in PDF 1.4 in 2001) is optional. It only gets created if the export tool explicitly emits it (Word's "Document structure tags for accessibility," InDesign's "Create Tagged PDF," etc.).

Without tags, a PDF reads like raw glyph positions to a screen reader — incoherent. With tags, it reads like a navigable document with headings, paragraphs, lists, tables, and image descriptions.

And the Office formats (.docx, .pptx, .xlsx)?

Word, PowerPoint, and Excel files are the opposite of a PDF: they are source formats, and their structure is native, not bolted on. Under the hood each is a ZIP archive of XML (the Office Open XML standard) — headings, slide titles, sheet names, lists, tables, alt text, and language are stored as explicit, semantic markup, because that is how the Office apps represent the document you are editing. That is why the source file is the best place to fix accessibility: correct it there, and every PDF you export from it inherits the structure automatically.

It also makes the audit simpler and safer for the Office formats than for PDF — the tool reads structure that is already there rather than inferring it from glyph positions. Because an OOXML file is still untrusted input, the parser is hardened against malicious files (uncompressed-size "zip-bomb" caps, a concurrency limit, and a timeout; see the README security section).

5. Why remediation is fundamentally limited

Auto-remediation applies to PDF only — Word, PowerPoint, and Excel don't need it (fix the source in Office directly, then re-export). Auditing is read-only: you walk the document's structure and report what's there. Remediation is read-modify-write — and PDFs make that genuinely hard:

PDF was designed for print, not semantics. The structure tree was bolted on in 2001 and is optional.
No canonical mapping from layout to semantics. Is a 14-pt bold line a heading or just emphasis? Software has to guess.
Content stream and structure tree are coupled. Adding alt-text means mutating both sides coherently — most libraries handle one but not both.
Tagged PDF spec is permissive. A PDF can satisfy the technical requirements for "being tagged" and still be inaccessible (e.g., every paragraph wrapped in <P> with no heading structure).
Mistakes compound. A wrong heading misleads a screen reader; a corrupted cross-reference makes the entire PDF unreadable.
Round-trip fidelity is the highest bar. Remediation must add semantic markup while preserving every visual nuance.

Result: PDF auto-remediation works well for the machine-checkable parts of accessibility (structure presence, metadata, language declaration). It falls back to human judgment for the semantically-judged parts (alt-text quality, reading-order intent, decorative vs. informative classification).

6. The remediation pipeline

PDF only. When the optional auto-remediation feature is enabled, clicking Remediate on a PDF result triggers a multi-stage pipeline. Every intermediate file is deleted before the next stage starts, and every deletion is verified by an fs.stat ENOENT check.

Why qpdf normalize first? OpenDataLoader's PDF writer corrupts the output xref table on certain modern Adobe InDesign 18 and Microsoft Word 365 inputs. The qpdf --object-streams=disable step decompresses object streams before ODL touches the file, which works around the bug entirely. The same step also repairs recoverably damaged uploads (qpdf rewrites a clean cross-reference table, exiting with a warning that the pipeline accepts as of v1.26.1) — so the slightly broken files most in need of remediation are repaired at intake rather than rejected. The ODL workaround was discovered during the v1.18.0 feasibility spike — see docs/archive/spike-remediation-results.md.

7. Application architecture

Two small services run on a single DigitalOcean droplet, managed by PM2 and fronted by Nginx. Every dependency is open source and runs locally. The PDF path shells out to qpdf (and, for remediation, the OpenDataLoader and veraPDF Java tools); the Word, PowerPoint, and Excel path needs none of those — it runs entirely in-process with the JSZip and fast-xml-parser JavaScript libraries.

8. The open-source toolchain

The open-source toolchain: each tool, its job, license, and pipeline stage
Tool	Job	License	Pipeline
qpdf	Structure parsing + PDF normalization	Apache 2.0	Audit + Remediation (PDF)
pdfjs-dist	Text + metadata extraction	Apache 2.0	Audit (PDF)
jszip	Unzip the OOXML package (.docx / .pptx / .xlsx)	MIT / GPLv3	Audit (Office formats)
fast-xml-parser	Parse OOXML structure & content	MIT	Audit (Office formats)
OpenDataLoader PDF	Rule-based PDF auto-tagging	Apache 2.0	Remediation (PDF)
veraPDF	PDF/UA-1 (ISO 14289-1) validation	MPL 2.0	Remediation (PDF)

Why OpenDataLoader matters: commercial PDF auto-tagging SDKs (Apryse, Adobe PDF Services, PDFix, CommonLook) start at ~$1,500/year and most are enterprise-quoted opaque pricing. OpenDataLoader, released as Apache 2.0 in 2024, is the first credible open-source alternative — and it ranks #1 overall (0.907) in 2026 PDF-extraction benchmarks. For this tool, it replaces what previously cost thousands a year with an apt install openjdk-17-jre-headless.

Accessibility Audit