Policy · v1.0

Data Retention Policy

Effective 2026-05-18 · Applies to tool version 1.18.0 and newer · This document is part of the open-source project source code and is version-controlled at apps/web/app/pages/data-retention.vue.

This policy describes how the ICJIA File Accessibility Audit tool ingests, processes, retains, and deletes PDF files and related metadata. It is intended for managers, records-retention officers, accessibility auditors, legal counsel, and other stakeholders who need a complete and accurate account of the tool's data handling. Technical details are included verbatim — vague language has been avoided in favor of precision.

No AI is used.

Your PDF is never sent to ChatGPT, GPT-4, GPT-4o, Claude, Gemini, Copilot, or any other artificial-intelligence service. No machine-learning model is loaded on this server. The tool uses rule-based, deterministic, open-source software exclusively. See § 4 below for the complete exclusion list.

Key facts at a glance

PDFs retained after processing

≤30 min

Maximum remediation-output retention

AI services contacted

Third-party data transmissions

100%

Deletion verified via fs.stat

7 yr

Audit-trail retention (configurable)

100%

Open-source toolchain

Open-source tools (qpdf · ODL · veraPDF)

1. Scope & applicable systems
2. Audit pipeline — how PDFs are handled when you click "Audit"
3. Remediation pipeline — how PDFs are handled when you click "Remediate"
4. AI usage statement (none)
5. The open-source toolchain (qpdf · OpenDataLoader · veraPDF · pdfjs)
6. Lifecycle audit trail (the auditor's evidence)
7. Retention periods by data category
8. What is and isn't stored
9. Security & technical safeguards
10. Security audit history (red/blue team reviews)
11. Right to inspect & verify
12. Standards & compliance alignment
13. Glossary of technical terms
14. Change log for this policy
15. Contact & questions

1. Scope & applicable systems

This policy applies to all PDF files processed by the ICJIA File Accessibility Audit tool — both the public production deployment at https://audit.icjia.app and any derivative deployment running the same source code. The infrastructure is hosted on DigitalOcean (a U.S.-based cloud provider), managed via Laravel Forge, and runs on a single virtual private server (VPS) located in a DigitalOcean data center. No content is replicated to external storage, content delivery networks, or backup services.

Two distinct processing pipelines exist within the tool:

The audit pipeline (always available) — analyzes a PDF for WCAG 2.1 AA / ADA Title II / Illinois IITAA accessibility conformance signals and returns a score and findings.
The remediation pipeline (optional, gated by the server-side REMEDIATION_ENABLED environment flag) — produces a tagged, more-accessible version of the uploaded PDF.

Both pipelines are described separately below because their data lifecycle differs. The audit pipeline operates entirely in memory; the remediation pipeline requires brief on-disk storage during processing, which is described in detail with corresponding deletion verification.

2. Audit pipeline (always available)

When a user uploads a PDF for auditing, the file is processed entirely in volatile server memory. No copy is written to disk at any point during the audit pipeline.

Client → HTTPS upload (multipart/form-data) │ ▼ [multer.memoryStorage()] — buffer in API process memory │ ▼ [validate file] - Magic-byte check: file must start with '%PDF-' - File size limit: 15 MB (configurable; rejected if exceeded) │ ▼ [analyzePDF(buffer, filename)] — runs synchronously ├── qpdf subprocess (file passed via stdin or temp pipe) │ • parses structure tree, language, outlines, images, tables └── pdfjs (Node.js library) • extracts text, metadata, per-page content order │ ▼ [scorer] — 9 WCAG-aligned categories, weighted overall score │ ▼ HTTP response → client (typically < 10 seconds total) │ ▼ Node.js garbage collector reclaims the buffer (file no longer exists in any form, anywhere)

Audit pipeline — visual flow

flowchart TD
    A[Upload PDF] --> B[Validate file]
    B --> C[Hold in memory only]
    C --> D[qpdf + pdfjs analyze]
    D --> E[Score 9 WCAG categories]
    E --> F[Send result to browser]
    F --> G[Memory buffer discarded]

Flowchart of the audit pipeline. The uploaded PDF is held in memory, validated, analyzed by qpdf and pdfjs, scored across 9 WCAG categories, and the memory buffer is discarded after the response is sent.

Once the HTTP response has been sent, the in-memory buffer is unreferenced and garbage-collected by the Node.js runtime in the next collection cycle. The PDF content does not persist on disk, in a cache, in a log file, or in any other location. The only records produced by an audit are entries in the audit_log table — described in § 8 — which contain metadata only (filename, score, grade, email if logged in, timestamp, and SHA-256 hash of the file's bytes).

Encrypted PDFs are rejected. A password-protected PDF cannot be analyzed without the password; the tool returns a clear error before any analysis is attempted, and the file is discarded immediately.

3. Remediation pipeline (optional, gated)

The remediation pipeline is disabled by default. It can be enabled by setting REMEDIATION_ENABLED=true in the server's environment. When enabled, a new Auto-Remediate this PDF action appears on the audit results page. Clicking it triggers the following lifecycle:

Client → HTTPS multipart upload (re-upload required by design) │ ▼ [validate] magic bytes, size cap, page count cap (500 pages) │ ▼ [create remediation_jobs row] • status: 'pending' • email (if logged in) • content_hash: SHA-256 of input bytes • download_token: 32-byte random, sha256-hashed at rest │ ▼ [write input → data/remediation/<jobId>/work/input.pdf] (mode 0600) │ ▼ [spawn detached worker: tsx src/jobs/remediate.ts <jobId>] │ ▼ (API responds 202 to client; worker runs independently) │ [Stage 1: preparing] • qpdf --object-streams=disable input.pdf → normalized.pdf • DELETE input.pdf + fs.stat verify ENOENT • Emit lifecycle event: 'normalize_complete', 'input_deleted', 'verified_absent' │ ▼ [Stage 2: tagging] • OpenDataLoader convert(normalized.pdf) → tagged.pdf • DELETE normalized.pdf + fs.stat verify ENOENT • Emit events: 'tagging_complete', 'intermediate_deleted', 'verified_absent' │ ▼ [Stage 3: validating] • qpdf --check tagged.pdf → must not report warnings • veraPDF --flavour ua1 tagged.pdf → conformance verdict (informational) • Emit: 'validation_passed' OR 'validation_failed' + 'verapdf_passed'/'verapdf_failed'/'verapdf_unavailable' │ ▼ [Stage 4: comparing] • Re-audit tagged.pdf → output score • If Overall, Strict, OR Practical score regresses: REJECT │ ▼ (success branch) [Move tagged.pdf → data/remediation/<jobId>.pdf (final, mode 0600)] [update job: status='complete', expires_at = NOW + 30 min] [Emit: 'output_ready'] │ ▼ Client polls /api/remediate/<jobId>/status; sees 'complete' │ ▼ Client downloads via single-use token: [stream output via createReadStream + pipe(res)] → on response 'close': DELETE output.pdf + fs.stat verify ENOENT → Emit: 'downloaded', 'output_deleted', 'verified_absent' → job status → 'expired' (token invalidated; concurrent requests get 410) │ ▼ (or, if no download in 30 minutes) [Cleanup sweep deletes output.pdf + fs.stat verify ENOENT] [Emit: 'expired', 'output_deleted', 'verified_absent'] ALL OUTCOMES → final state: zero PDF artifacts on disk.

Remediation pipeline — visual flow

flowchart TD
    A[Upload PDF] --> B[Write to scratch]
    B --> C[qpdf normalize]
    C --> D[Delete original + verify]
    D --> E[OpenDataLoader tag]
    E --> F[Delete normalized + verify]
    F --> G[qpdf check + veraPDF]
    G --> H[Re-audit, guard regressions]
    H --> I[Output ready, 30 min TTL]
    I --> J[User downloads]
    J --> K[Delete output + verify]

Flowchart of the remediation pipeline. Eleven steps from upload through final delete + verify. Every intermediate file is deleted before the next stage starts, and every delete is fs.stat-verified.

Key invariants of the remediation pipeline:

At any instant during the pipeline, at most one copy of the PDF exists on disk. The input is deleted before the normalized intermediate is written for downstream stages; the normalized intermediate is deleted before the tagged output is finalized; the tagged output is deleted on first download or after 30 minutes.
The entire scratch directory (data/remediation/<jobId>/work/) is removed in a finally block regardless of pipeline outcome — including crashes, errors, and rejected outputs. A worker crash mid-pipeline triggers cleanup on API restart (see § 9).
The remediated output is served via a one-time download token. The token is generated as 32 cryptographically random bytes and stored on the job row only as its SHA-256 hash (the raw token is never stored). A successful download invalidates the token immediately, before the file contents are streamed, so any concurrent or repeat request receives 410 Gone.
The remediation pipeline does not cache the PDF between audit and remediation. Clicking "Remediate" triggers a fresh multipart upload — the just-audited buffer is not preserved. This is a deliberate UX-vs-privacy trade-off that costs the user one extra upload click in exchange for a stricter retention posture.
The pipeline runs entirely on the ICJIA-controlled server. No PDF content is transmitted to external services, cloud APIs, or AI models — see § 4.

4. AI usage statement

No artificial intelligence is used in this tool.

Specifically: no PDF content, no extracted text, no metadata, no filenames, no derivative artifacts, no diagnostic data, and no telemetry of any kind are transmitted to any artificial-intelligence service, large language model, vision model, or hosted machine learning API, including but not limited to:

OpenAI (ChatGPT, GPT-3.5, GPT-4, GPT-4o, embedding APIs)
Anthropic (Claude family of models)
Google (Gemini, Bard, PaLM, Vertex AI)
Microsoft (Copilot, Azure OpenAI)
Meta (Llama hosted endpoints)
Amazon (Bedrock, SageMaker hosted endpoints)
Any open-source model hosted by a third-party inference provider (Replicate, Modal, Hugging Face Inference API, etc.)
Self-hosted machine-learning models on this server (none are loaded)

The auto-remediation pipeline uses three open-source software tools (qpdf, OpenDataLoader PDF, veraPDF — see § 5), all of which operate on rule-based, deterministic algorithms. None of these tools load or run a machine-learning model at runtime. Their source code is publicly available and auditable.

A future feature on the project roadmap (Phase 1, documented at docs/pdf-remediation-alt-text-walkthrough-spec.md) adds an interactive walkthrough that lets users manually author alt-text for figures in their remediated PDFs. This feature is specifically designed to be AI-free — the user types descriptions themselves, and the descriptions are written back into the PDF by the deterministic pdf-lib library. No AI suggestion, no autocomplete from a model, no inference of any kind.

Any future addition of AI features will be announced in this policy and in the public changelog before the feature is enabled in production, with a corresponding update to the policy version above.

No AI services contacted

flowchart TD
    A[Your PDF] --> B[ICJIA server]
    B --> C[qpdf local]
    B --> D[OpenDataLoader local]
    B --> E[veraPDF local]
    B --> F[SQLite local]
    B -.OTP code only.-> G[Mailgun email]
    B -.NEVER.-> X[ChatGPT, Claude, Gemini, Copilot]

Flowchart showing the ICJIA server talks only to local tools (qpdf, OpenDataLoader, veraPDF, SQLite) and Mailgun (for OTP codes only). It NEVER sends data to ChatGPT, Claude, Gemini, or Copilot.

5. The open-source toolchain

Every tool involved in processing PDFs is open source, license-clear, and runs locally on the ICJIA-controlled server. No commercial PDF SDK is licensed, and no per-document fees are paid. The tools are:

Tool	Used for	License	Source
qpdf	PDF structure parsing (audit) and PDF normalization + validity checking (remediation)	Apache 2.0	qpdf.sourceforge.io
pdfjs-dist	PDF text and metadata extraction (audit pipeline only)	Apache 2.0	github.com/mozilla/pdf.js
OpenDataLoader PDF	Rule-based PDF auto-tagging (remediation pipeline only)	Apache 2.0	opendataloader-project (ICJIA fork: ICJIA/opendataloader-pdf)
veraPDF	PDF/UA-1 (ISO 14289-1) conformance validation (remediation pipeline only; optional)	MPL 2.0	verapdf.org

Each tool is invoked as a separate operating-system process (qpdf, OpenDataLoader, veraPDF) or as a Node.js library (pdfjs-dist), with input passed by file path or in-memory buffer. None of these tools opens an outbound network connection during processing. Outbound network traffic from the API process during a remediation job is limited to the email server (Mailgun, used for OTP authentication of users only — not for any content transmission) and the database (local SQLite, no network).

6. Lifecycle audit trail

Every remediation job produces an append-only series of timestamped events in the server's SQLite database file (apps/api/data/audit.db, table remediation_events). The same database also holds the lighter-weight audit log (audit_log table) for plain audit requests. Schemas:

CREATE TABLE remediation_events ( id INTEGER PRIMARY KEY AUTOINCREMENT, job_id TEXT NOT NULL, event TEXT NOT NULL, occurred_at INTEGER NOT NULL, -- milliseconds since Unix epoch details TEXT, -- JSON, content-free metadata only FOREIGN KEY (job_id) REFERENCES remediation_jobs(id) ); CREATE INDEX idx_remediation_events_job ON remediation_events(job_id, occurred_at); CREATE INDEX idx_remediation_events_event ON remediation_events(event); CREATE TABLE remediation_jobs ( id TEXT PRIMARY KEY, -- UUIDv4 email TEXT, -- null when anonymous input_filename TEXT NOT NULL, content_hash TEXT, -- SHA-256 of input bytes page_count INTEGER, status TEXT NOT NULL, -- pending/running/complete/failed/expired step TEXT, progress_pct INTEGER DEFAULT 0, input_score REAL, -- pre-flight audit score output_score REAL, -- post-remediation audit score output_valid INTEGER, -- 1 = qpdf --check passed output_path TEXT, -- absolute path on disk, only when complete download_token_hash TEXT, -- SHA-256 of raw token failure_reason TEXT, verapdf_available INTEGER, verapdf_passed INTEGER, verapdf_summary_json TEXT, input_audit_json TEXT, -- full pre-flight ScoringResult output_audit_json TEXT, -- full post-remediation ScoringResult created_at INTEGER NOT NULL, completed_at INTEGER, expires_at INTEGER NOT NULL );

The closed set of event types emitted per job is:

receivedprocessing_startednormalize_completeinput_deletedtagging_completeintermediate_deletedvalidation_passedvalidation_failedverapdf_passedverapdf_failedverapdf_unavailableoutput_readydownloadedoutput_deletedverified_absentverify_failedexpirederror

The verified_absent event is the critical compliance signal. It is emitted only after the worker (or the cleanup sweep, or the download handler) calls fs.unlink() followed by fs.stat() on the deleted path, and receives an ENOENT (no-such-entity) response — definitively confirming the file no longer exists on the filesystem. If fs.stat() returns any other result (file still present, permission error, etc.), a verify_failed event is recorded instead, indicating a compliance anomaly that must be investigated.

File paths in event payloads are stored as SHA-256 hashes, not raw strings. This keeps the payload uniform-length, resistant to log-scraping, and ensures the audit trail cannot accidentally reveal directory structure or user identifiers via path strings.

A sample event payload (the details JSON for a verified_absent event):

{ "path_hash": "a3f5e7d2c4b6a8e9f1c3d5b7a9e1c3d5b7a9e1c3d5b7a9e1c3d5b7a9e1c3d5b7" }

The audit trail is intentionally append-only: no application code path overwrites or deletes individual event rows. Rows are purged only by the periodic cleanup sweep after they exceed the retention period (see § 7), which executes a single DELETE statement bounded by an age cutoff. Anomalies — for example, a job that completed without a corresponding verified_absent event — are visible to any auditor running a sentinel query.

7. Retention periods by data category

Data category	Where stored	Maximum retention	Configurable
Uploaded PDF (audit)	Server memory only	Seconds; discarded after HTTP response	No
Uploaded PDF (remediation input)	`data/remediation/<jobId>/work/input.pdf`	Seconds; deleted after qpdf normalize stage	No
Normalized intermediate PDF	`data/remediation/<jobId>/work/normalized.pdf`	Seconds; deleted after OpenDataLoader tag stage	No
Remediated tagged PDF (output)	`data/remediation/<jobId>.pdf`	First download OR 30 minutes (whichever first)	Yes — `REMEDIATION.OUTPUT_TTL_MS`
Remediation job row (metadata only)	SQLite, `remediation_jobs` table	30 days after completion	Yes — `REMEDIATION.JOB_ROW_RETENTION_DAYS`
Lifecycle events (audit trail)	SQLite, `remediation_events` table	7 years (default)	Yes — `REMEDIATION.EVENT_LOG_RETENTION_DAYS`
Audit log (plain audits, no PDFs)	SQLite, `audit_log` table	Indefinite (purgeable on request)	By admin request
Shared reports (audit results only)	SQLite, `shared_reports` table	365 days from share creation	Yes — `SHARED_REPORTS.EXPIRY_DAYS`
OTP authentication codes	SQLite, `otp_codes` table	10 minutes (single-use)	Yes — `AUTH.OTP_EXPIRY_MINUTES`

Retention periods marked "configurable" can be adjusted in the source configuration file (audit.config.ts) before deployment. The defaults shown represent the standing posture for the production deployment; any deployment running modified values publishes those values in its own deployment notes.

A periodic cleanup sweep runs every 5 minutes within the API process and on every API startup. It performs five tasks idempotently: expire outputs past expires_at; mark stuck jobs as failed; remove orphan directories; purge old remediation_jobs rows; purge old remediation_events rows. Source: apps/api/src/services/remediationCleanup.ts.

8. What is and isn't stored

Stored (metadata only)

Filename of the uploaded PDF (sanitized before storage)
SHA-256 hash of file bytes (a 64-character hex digest)
Page count (integer)
Pre-flight audit score and grade (numbers + letter)
Post-remediation audit score and grade
Per-category findings and explanations (text generated by the scorer, never copied from the PDF)
User's email address (only if logged in; tied to the user's own account)
Server-side timestamps for every lifecycle event
Job status, step name, progress percentage
Failure reasons (string descriptions, no content)
veraPDF verdict summary (passed/failed + rule IDs of failing rules, no content)
SHA-256 hash of download token (token itself never stored)
SHA-256 hash of deleted file paths (paths never stored)

Never stored

PDF file content (audit pipeline)
PDF file content after a remediation completes (output is deleted on download or after 30-minute TTL)
Extracted text from inside PDFs
Images extracted from PDFs (none are stored)
Form-field values from PDFs
Any data transmitted to AI services (there are none — see § 4)
Any data shared with third-party analytics, ad networks, or tracking services
Browser fingerprints or cross-site tracking identifiers
IP addresses of users (beyond what's in standard web logs)
Raw file paths in lifecycle events (paths are hashed before storage)
Raw download tokens (tokens are hashed; the original 32-byte random token is held in the URL only)
Backups of the SQLite database to external storage (the database is on the local filesystem only)

9. Security & technical safeguards

HTTPS / TLS 1.2+ on all transport between client and server. The production deployment uses certificates issued by Let's Encrypt and renewed automatically.
HTTP-only cookies for authentication, with SameSite=Strict set to prevent cross-site request forgery.
Restrictive filesystem permissions on remediation data: 0700 on directories, 0600 on output files. Only the process owner can read these files.
Unguessable identifiers: job IDs are UUIDv4 (122 bits of cryptographic entropy); download tokens are 32-byte random base64url-encoded strings.
Constant-time-ish token comparison: download tokens are compared via byte-wise XOR over fixed-length SHA-256 hashes, mitigating timing side channels.
Magic-byte validation on uploads: a file is rejected immediately if its first five bytes are not %PDF-.
File size cap: 15 MB for the audit pipeline, 50 MB for the remediation pipeline (configurable).
Page count cap: 500 pages for remediation (configurable). Pathological PDFs with thousands of pages are rejected before any processing.
JVM memory cap on the OpenDataLoader child process: 768 MB heap via JAVA_TOOL_OPTIONS=-Xmx768m to bound resource consumption.
Wall-clock timeout on the remediation worker: 5 minutes (configurable). The JVM child is killed on overrun.
Per-user concurrency limit: 1 remediation job at a time per user (configurable).
Rate limiting on upload endpoints to prevent abuse.
Encrypted PDFs are rejected with a clear error before any analysis is attempted.
Cleanup on startup: when the API restarts, a sweep reconciles disk vs database — jobs stuck in "running" for over 10 minutes are marked failed; orphan files with no matching database row are removed.
Regression guard on remediation: the output PDF is rejected if its score regresses on Overall, Strict, or Practical profiles relative to the input. The user never sees an output that would make any visible metric worse.

10. Security audit history (red/blue team reviews)

What is a red/blue team audit, in plain language?

Imagine the tool is a bank vault. The red team plays the role of someone trying to break in — looking for unlocked doors, weak walls, or ways to trick the guards. They aren't actually attackers; they're security-minded reviewers who deliberately think like attackers. The blue team plays the defenders — documenting every lock, alarm, and procedure that's supposed to keep the vault safe.

A red/blue team audit is when both teams sit down together — often the same person playing both roles — and systematically work through everything that could go wrong: "What if someone uploads a poisoned file?" "What if two people try to download the same thing at once?" "What if the server runs out of memory mid-job?" For each scenario, they identify whether existing protections are adequate, what could fail, and how to fix it.

The output is a list of findings, each rated by severity:

P0 — critical: the system is broken right now and users are exposed. Must be fixed immediately, before any release.
P1 — serious: a real vulnerability that could be exploited. Must be fixed before the upcoming release.
P2 — moderate: a real concern, but its impact is bounded by other protections. Documented; sometimes accepted as a known limitation if mitigation is in place.
P3 — minor: a small concern or theoretical risk. Tracked; addressed when convenient.

Why this matters for compliance: ADA Title II, Illinois IITAA, and most state-agency procurement standards require a "reasonable" level of security. A documented red/blue team audit before each release is concrete evidence of due diligence — it demonstrates that the development team didn't just hope nothing would go wrong, they systematically checked. For an external auditor, this section IS the documentation of that diligence.

Audit entries below are in reverse-chronological order (most recent first). Each entry lists the findings discovered during that release's review and what was done about them.

v1.20.1

Audited 2026-05-18 · scope: post-feature red/blue team review of the v1.20.0 fleet-integration surface

This is a dedicated security release that follows the team's standing practice: every feature ships through a fresh red/blue team review before tagging. The v1.20.0 release introduced the fleet-audit-by-URL endpoint; this review examined that new surface plus the related existing endpoints, found seven issues worth flagging, and fixed all of them before this release was tagged. The purpose of this entry is to document those findings so an auditor can see (a) what was looked at, (b) what was discovered, (c) what was done about it, and (d) how the team's iterative-review pattern works.

Findings & what was done

P1 Fixed — A DNS-based trick could have let an attacker reach the server's own internal network through our URL audit endpoint.
What was wrong: when someone submitted a URL for audit, the tool checked whether the hostname matched the allowlist of approved ICJIA domains before fetching it. If an attacker could control DNS for any subdomain of an approved domain — for example, by compromising a partner agency that operates a subdomain — they could point that hostname at the server's loopback address (127.0.0.1) and trick us into fetching our own internal services on their behalf.
How it was fixed: the tool now resolves the hostname's IP address itself, before fetching, and refuses to connect to any IP in private, loopback, link-local, or multicast ranges. The check repeats on every redirect hop so a redirector planted on an approved host can't chain us into a private address either. The fix covers both IPv4 and IPv6.
P1 Fixed — Redirects from approved hosts to private addresses were silently followed.
What was wrong: when the URL audit endpoint encountered an HTTP redirect, it followed the chain up to 20 hops without re-checking each hop against the allowlist. An attacker who could place content on an approved host could redirect us through to an internal address.
How it was fixed: redirects are now handled manually with the full allowlist and DNS-IP check on every hop, capped at three redirects total.
P1 Fixed — The bulk-inventory endpoint had no allowlist check at all.
What was wrong: caught during the security review while migrating the other URL-fetch endpoints. The bulk-inventory endpoint accepts a list of PDF URLs and fetches each one. It had its own private fetcher with no allowlist — an authorized user could submit a list containing internal addresses and the tool would fetch them. Latent since the endpoint shipped, not previously discovered.
How it was fixed: the bulk endpoint now uses the same allowlist-plus-private-IP-block plumbing as the other URL endpoints.
P2 Fixed — In no-login deployments, one user could unlock remediation for content audited by a different user.
What was wrong: when the tool is run without requiring login, every user is treated as the same "anonymous" identity. The new audit-before-remediation check (added in this release — see "Added" below) would have matched any anonymous user's audit against any other anonymous user's remediation attempt.
How it was fixed: in no-login mode, the identity now includes the user's IP address. The production deployment requires login, so this issue never affected real users.
P2 Fixed — The audit-history table grew without limit.
What was wrong: the canonical audit-history table had no retention policy. An attacker repeatedly auditing unique files could slowly fill the database.
How it was fixed: records older than 365 days are now purged by the periodic cleanup sweep, matching the share-link retention window.
P2 Fixed — A narrow race window let two simultaneous remediation requests both pass the daily limit.
What was wrong: the daily-limit check and the actual job-creation were two separate steps. Two perfectly-simultaneous requests at the cap boundary could both see "you're under the limit" and both proceed.
How it was fixed: the limit check is now repeated as part of the same atomic database transaction that creates the job, so the cap can no longer be exceeded by even one.
P3 Verified clean — Browser cookie security flags.
What was checked: the login session cookie is set with the protective flags (HttpOnly, Secure, SameSite-Strict) that prevent it from being read by client-side scripts, transmitted over plain HTTP, or sent with cross-site requests.
Result: all three flags are correctly set in production. No change needed; recorded in this audit trail for completeness.

Also added in this release — driven by the same security thinking

Audit required before remediation. Every request to remediate a PDF must be preceded by an audit of the same content within the previous 60 minutes. Any audit path counts — direct upload, URL audit, or fleet bulk. This prevents automated abuse where someone bypasses the audit pipeline and floods the remediation worker directly.
Daily remediation cap. Up to 100 remediations per caller per 24 hours. Sized so a normal agency workflow (~50 PDFs in a busy day) is unaffected, but a flood of thousands is blocked.
Unified audit record. Every audit endpoint now writes a row to the canonical audit-history table with the content fingerprint (SHA-256 hash of the file's bytes). Required so the audit-before-remediation gate works uniformly across all audit paths. The hash is just a fingerprint — it doesn't expose the PDF's contents and can't be reversed back into the document.

Methodology — for the auditor record

The team follows a deliberate practice: every feature ships through a fresh red/blue team review before tagging. The review examines the newly-introduced surface from a sophisticated-adversary perspective, looks for attack patterns like DNS rebinding, race conditions, identity collapse, and slow-burn denial-of-service, and either fixes findings in the same release window or documents them for future work. This release (v1.20.1) is the security-followup to v1.20.0, which added the fleet-audit-by-URL feature. The pattern repeats with every feature release — earlier entries in this audit history list the findings from prior reviews.

For a manager reading this page: the intent here is transparency. The tool is built and reviewed iteratively, and this page is the auditor-readable trail of what was reviewed, what was found, what was fixed, and what was deliberately accepted with mitigation. The technical equivalent (with full code references) lives in README.md § Security for engineers and security reviewers who need that level of detail.

v1.20.0

Audited 2026-05-18 · scope: download filename dialog, PDF export, accessibility polish

A feature release with two material auditor-facing changes: remediated PDFs can now be downloaded under the exact original filename (critical for CMS file replacement, where existing links resolve by name), and the audit report can be saved as a PDF using the browser's own print dialog. No new data is collected, retained, or transmitted. The retention policy described elsewhere on this page is unchanged.

Findings & changes

P3 Changed — Remediated PDF download now defaults to the user's exact original filename.
What changed: when a user remediates a PDF and clicks Download, the file is now saved under the same filename they uploaded — including any spaces, unicode, or punctuation. The download dialog presents three options with "Keep original filename" pre-selected and badged Recommended. The other two ("Add a _remediated suffix" or "Use a different filename") are opt-in.
Why: the most common workflow for remediating an agency PDF is to replace the file in the CMS in place — every existing link on the website, in old emails, in shared documents, keeps working as long as the filename matches. The previous behavior automatically appended _remediated to the filename, which broke this workflow.
Safeguards: the "use a different filename" path explicitly warns the user that the change will break existing links and requires a second click of the Download button to confirm. There is no path traversal risk — the custom filename is treated only as a display name for the browser's save dialog and is capped, encoded, and forced to .pdf before being sent in the response header. The actual file on disk is always located by job ID, never by user-supplied filename.
P3 Added — Audit reports can now be saved as PDF via the browser's print dialog.
What changed: the audit report page and the shared-report page each gained a "PDF (browser print)" button. Clicking it opens the browser's own print dialog, where the user picks "Save as PDF" as the destination. The page applies a print stylesheet that hides interactive controls, switches to black-on-white text, expands collapsed technical sections, and arranges page breaks cleanly.
What this does not change: no new server-side rendering happens — the PDF is created entirely by the user's own browser, on the user's own machine. No PDF content is transmitted to or stored on our server as part of this feature. The chosen filename is whatever the user types in the browser's save dialog and is not visible to us.
P3 Fixed — Accessibility polish on the remediation result page.
What changed: the result page was showing layout shift after content loaded (a known accessibility annoyance for users on slow connections or with reduced-motion preferences), and result sections were appearing partway through the progress animation rather than after it. Both fixed.
Visible improvement: Lighthouse performance score on the result page rose from 84 to 96 on desktop. No retention or privacy implications.

Operational improvements

New AGENTS.md at the repository root documents the load-bearing conventions for AI coding agents (Claude Code, Codex, Cursor, etc.) so engineers using those tools to extend the code base get oriented in one read. Not user-facing; reduces the chance of a misconfigured agent committing the wrong thing.
The "Technical Details" expandable on the main results page now includes the same four pipeline diagrams already on the standalone Technical Details page.

v1.19.0

Audited 2026-05-18 · scope: fleet integration + accessibility polish + retention-policy change

This release adds the fleet inventory integration (one HTTP call per PDF returns strict + practical grades plus a year-long shareable report link), expands the URL allowlist to cover all *.illinois.gov state-agency subdomains, bumps the shared-report retention window from 15 days to 365 days, and fixes seven accessibility rule violations across the public policy + technical-details pages. The most material policy change for an auditor reading this page is the retention bump — see the first finding below.

Findings & changes

P2 Accepted — Shared-report retention window extended from 15 days to 365 days.
What changed: when someone creates a shareable audit-report link (either from the web UI's "Create Shareable Link" button or via the new fleet audit-by-URL automation), the resulting link now stays valid for one year instead of 15 days. This applies to the metadata record only — no PDF content is stored alongside it. After 365 days the row becomes eligible for the periodic cleanup sweep and the URL stops working.
Why: auditors and managers reviewing fleet-inventory reports (which list every PDF across ICJIA's sites) need report links that survive between quarterly review cycles. A 15-day TTL caused most links to break before the next review even happened.
Storage cost: the row holds scores, category findings, and timestamps — no PDF bytes. A 100-PDF fleet at roughly 50 KB per record grows the database by about 5 MB per year. The tradeoff was evaluated and accepted in favor of usability.
P2 Accepted — URL allowlist expanded so the fleet automation can audit PDFs across the full Illinois state-agency footprint.
What changed: the audit-by-URL endpoint previously accepted only a handful of explicit ICJIA subdomains. It now also accepts: illinois.gov (every state-agency subdomain), icjia.cloud, icjia.app, and ilheals.com (each including all subdomains).
Why: the ICJIA fleet audit lists PDFs across every site the agency operates and every partner agency. The previous narrow allowlist couldn't cover that fleet.
What it doesn't change: all of the existing protections still apply — the server still blocks private / local / loopback addresses (no SSRF into internal networks), still rejects oversized files (100 MB cap), still requires the fetched bytes to begin with the %PDF- header, and still rejects look-alike domains (a URL like illinois.gov.evil.com does not match the allowlist). The threat profile is the same as a person pasting any one of these URLs into the web interface.
P3 Fixed — Seven accessibility rule violations on the public policy and technical-details pages.
What was wrong: a full axe + Lighthouse audit found that the diagram boxes on these pages couldn't be reached via keyboard, that an inline link in this audit history section was distinguishable only by color (a barrier for colorblind readers), and that several scrollable code blocks couldn't be scrolled without a mouse.
How it was fixed: each scrollable region is now keyboard-focusable, the inline link is now underlined, and the diagram boxes' redundant ARIA labels were replaced with proper structural markup. Both pages now score a perfect 100 / 100 on both axe (no violations) and Lighthouse's accessibility audit.
P3 Fixed — The new fleet endpoint reported the strict score in both the strict and practical slots of its response.
What was wrong: the new /api/audit-url endpoint had a key-name mismatch with the underlying scoring engine — what the engine internally calls "remediation" the user interface labels "practical." The endpoint looked for the wrong name, found nothing, and fell back to the strict score, so the practical column in the fleet output would have shown the strict number instead of the practical one.
How it was fixed: caught in the local smoke-test step before any caller integrated against the endpoint, so no production fleet report ever published the wrong number. The name mapping is now correct (verified against three test PDFs whose strict and practical scores genuinely differ).

v1.18.1

Audited 2026-05-18 · scope: veraPDF integration correctness + remediation result-page UX

A patch release with four operational fixes against the v1.18.0 remediation feature. None of these findings expose private data or change the file-retention guarantees described elsewhere on this page. One finding is security-adjacent: an auditor who consulted the PDF/UA-1 compliance card on the remediation result page would have seen a silently wrong verdict in any deployment running a recent veraPDF version. Note: at the time of the fix, this feature flag was still off in production, so no real audit was shown the wrong verdict.

Findings

P1 Fixed — PDF/UA-1 compliance verdict was always shown as "not compliant," regardless of the actual PDF.
What was wrong: the tool calls a third-party validator (veraPDF) to report whether the remediated PDF technically conforms to the PDF/UA-1 accessibility standard. The newest version of that validator changed the shape of its result data slightly (it now returns a list of profile results rather than a single one). The tool was reading the result in the old shape, so the verdict was always missing, and the missing verdict was treated as "not compliant." Any auditor looking at the compliance card on the result page would have been shown an incorrect technical verdict.
How it was fixed: the tool now handles both the new and old result shapes correctly. Verified against a live install of the latest veraPDF version. No production deployment had this feature enabled yet at the time of the fix, so no real audit was actually shown the wrong verdict.
P2 Fixed — A second veraPDF shape change could have caused a crash inside the validation routine.
What was wrong: in the same shape change that broke the verdict, veraPDF also moved its rule-by-rule detail list. A defensive fallback in the tool would have tried to read the new "count of failed rules" as if it were a list, which would have crashed the validation routine on certain inputs.
How it was fixed: the unsafe fallback was removed and the read order was updated to prefer the new location first. No crashes were observed in production — this was caught during the same review as the P1 above.
P3 Fixed — Failure count under-reported on heavily-non-compliant PDFs.
What was wrong: the tool reported a compliance-failure total based on the top 20 issues it displayed, rather than veraPDF's own total. On a deeply non-compliant PDF the displayed total would have been lower than reality.
How it was fixed: the tool now uses veraPDF's own total when available. Older veraPDF versions still use the "sum the displayed list" fallback.
P3 Fixed — The "Fix steps" links on the remediation result page were dead.
What was wrong: clicking "Fix steps" next to an outstanding issue on the result page did nothing. The link tried to jump to a card that exists on the audit page but not the result page.
How it was fixed: each issue row now opens an inline accordion showing the detailed findings and numbered Adobe Acrobat fix steps right there on the result page — no navigation needed. Same content as the audit-page cards. Not a privacy or security issue, but a real usability problem for an auditor following up on outstanding items.

Operational improvements

The Ubuntu deploy script (rebuild.sh) now auto-detects an installed veraPDF and, when it isn't installed, prints copy-paste install instructions including the persistence command so the path survives a server reboot. Reduces drift between development and production installs.

v1.18.0

Audited 2026-05-18 · scope: PDF auto-remediation feature (entire new surface)

The remediation pipeline was the first major surface added to this tool. The pre-release red/blue team review covered the public API endpoints, the worker, the frontend, the cleanup sweep, the database schema, and the file lifecycle. The 15-row threat-model checklist documented in docs/pdf-remediation-integration-plan.md (§ Security) was the basis of the review.

Findings

P1 Fixed — Memory exhaustion via large output downloads.
What was wrong: the download endpoint loaded the entire remediated PDF (up to 50 MB) into the API process's memory before sending it to the user's browser. Under several simultaneous downloads, this could exceed the API process's 512 MB memory cap and crash it.
How it was fixed: switched to streaming the file in small chunks (createReadStream + stream.pipe(res)). Memory usage is now constant regardless of output size.
P1 Fixed — Race condition allowed concurrent double-download.
What was wrong: the download token was supposed to be single-use, but two near-simultaneous requests with the same token could both pass the validation check and both retrieve the file before either completed. This violated the "single-use" privacy guarantee.
How it was fixed: the job is marked status='expired'before the file is sent, so any concurrent second request immediately sees the expired status and receives a "410 Gone" response.
P2 Mitigated — Auth-bypass when login is not required (dev/internal mode).
What was found: when the tool runs with the "require login" flag turned off (typical for internal development), the per-job email check on the status, download, and receipt endpoints is bypassed. Anyone who knows a job's UUID could read its data.
How it was handled: job UUIDs use 122 bits of cryptographic randomness — guessing one is computationally impractical. Production deployments run with login required, which closes the gap entirely. This limitation is documented in the integration plan as the known posture; it does not affect the production deployment.
P2 Accepted — Legacy scoring data computed but unused.
What was found: the Adobe Acrobat parity score (a 32-rule check) is still calculated on the server even though the user interface no longer displays it. Costs about 50 milliseconds per audit.
How it was handled: intentionally kept for data-shape stability so existing tests and audit-log entries continue to work. May be removed in a future release if the cost ever matters. Not a privacy or security issue — just dead code.
P3 Accepted — Conservative PDF validation rejects borderline files.
What was found: the qpdf --check validator flags some technically-valid PDF outputs as "warnings," which the tool treats as failures.
How it was handled: accepted by design. Better to reject a borderline file (the user is told the remediation didn't work, can try a different path) than to serve a file that might be damaged and contaminate the user's records. Privacy and integrity over feature completion.

Pre-launch items still open

External penetration test on the remediation surface (planned before public-announce; budget tracked in Phase 4 roadmap).
Full automated test coverage for the remediation pipeline (remediation.test.ts, remediation-privacy.test.ts, remediation-receipt.test.ts). Tracked in Phase 4.
File the upstream OpenDataLoader object-streams bug with reproducer PDFs (the qpdf preprocessing workaround is in place in the meantime).

v1.17.0 and earlier

Pre-formatted-audit era

Security reviews for releases prior to v1.18.0 were not yet captured in this format. Earlier releases focused on the synchronous audit pipeline (added in v1.0) and authentication flow (Personal Access Tokens added in v1.16, analyze-by-URL added in v1.17). Review history for those releases is available via the commit history on GitHub. Going forward — beginning with v1.18.0 — every release will have a corresponding entry in this section before tagging.

11. Right to inspect & verify

Authorized agency staff — including managers, records-retention officers, and accessibility auditors — can inspect the lifecycle of any specific remediation job by querying the SQLite database directly. Sample queries for common compliance questions:

-- All remediations a specific user performed in a date range SELECT id, input_filename, status, input_score, output_score, datetime(created_at/1000, 'unixepoch', 'localtime') AS started, datetime(completed_at/1000, 'unixepoch', 'localtime') AS finished FROM remediation_jobs WHERE email = ? AND created_at BETWEEN ? AND ? ORDER BY created_at DESC; -- Full lifecycle of a specific job SELECT event, datetime(occurred_at/1000, 'unixepoch', 'localtime') AS at, details FROM remediation_events WHERE job_id = ? ORDER BY occurred_at; -- Sentinel: any job whose output was retained past the 30-minute TTL SELECT j.id, j.input_filename, (e.max_at - j.completed_at) / 60000 AS extra_minutes_on_disk FROM remediation_jobs j JOIN ( SELECT job_id, MAX(occurred_at) AS max_at FROM remediation_events WHERE event IN ('output_deleted', 'verified_absent') GROUP BY job_id ) e ON e.job_id = j.id WHERE j.status IN ('expired', 'complete') AND (e.max_at - j.completed_at) > 30 * 60 * 1000; -- This query should return ZERO ROWS for a properly-functioning system. -- Sentinel: any deletion that wasn't verified absent SELECT job_id, occurred_at FROM remediation_events WHERE event = 'output_deleted' AND NOT EXISTS ( SELECT 1 FROM remediation_events e2 WHERE e2.job_id = remediation_events.job_id AND e2.event = 'verified_absent' AND e2.occurred_at >= remediation_events.occurred_at ); -- This query should ALSO return ZERO ROWS.

A Phase 3 roadmap item adds a manager-facing verification endpoint that accepts a filename or a file's SHA-256 hash and reports whether the file was ever audited or remediated, with full timestamps. The underlying content_hash column has been populated on every audit and remediation since v1.18.0 in preparation for that feature. Until that endpoint ships, equivalent information is available via direct database query as shown above.

A user can also see their own complete remediation receipt by visiting the result page for any of their jobs (URL pattern: https://audit.icjia.app/remediate/<jobId>). The receipt shows every lifecycle event with human-readable labels, including the verified-deletion event.

11. Standards & compliance alignment

The tool's design and this policy aim to align with the following standards and regulations. Alignment with a standard does not constitute certification — official conformance audits remain the responsibility of the user agency.

WCAG 2.1 Level AA (Web Content Accessibility Guidelines, W3C) — the audit scores PDFs against the nine categories that map to WCAG 2.1 AA success criteria for non-web documents.
ADA Title II (U.S. federal law, effective April 2026 for state and local government digital content) — informs the tool's diagnostic and remediation framing.
Illinois IITAA (Information Technology Accessibility Act) — the tool's compliance disclaimers on the remediation result page link to the Illinois DOIT accessibility standards.
PDF/UA-1 (ISO 14289-1) — the remediation pipeline uses veraPDF to validate output against PDF/UA-1 technical conformance. veraPDF's verdict is surfaced honestly on the result page; manual review is acknowledged as still required for full accessibility.
State of Illinois records-retention policy — the default 7-year retention period for the remediation_events audit trail matches typical state-agency records-retention schedules. Adjust via configuration if your agency's schedule differs.

12. Glossary of technical terms

Append-only audit log: A database table whose rows are added but never modified or deleted by application code. Rows are removed only by an explicit retention-policy purge after a configured age. Append-only design ensures the audit trail is tamper-evident from inside the running system.
ENOENT (Error: No such ENTity): The error code returned by the operating system when a program asks for the status of a file that doesn't exist. The remediation worker uses an fs.stat() call expecting ENOENT after a delete — receiving any other response indicates the file is still present, which is treated as a compliance anomaly.
fs.stat(): A Node.js function that asks the operating system whether a file exists and, if so, returns its size, permissions, and timestamps. We use it specifically to confirm that a file has been deleted (we expect a "no such file" response).
PDF/UA-1: ISO 14289-1: the technical specification for "accessible PDF." Defines the structural requirements (tags, language declaration, metadata) a PDF must meet to be considered conformant. Validated by veraPDF.
Remediation: The process of taking an existing PDF and adding accessibility structure to it after the fact. Distinguished from accessible authoring, which produces a tagged PDF directly from a source document.
SHA-256 hash: A cryptographic function that turns any input into a fixed-length (64-character) hexadecimal string. The hash is one-way: you can compute the hash from the input, but not the input from the hash. We use it for two purposes here: (1) as a content fingerprint to identify whether two files are the same without storing the files themselves; (2) as a token comparison mechanism that resists timing attacks.
Structure tree / tagged PDF: An optional second layer inside a PDF that describes the semantic role of each piece of content (heading, paragraph, figure, table cell). A PDF with this layer populated is called "tagged" and is readable by screen readers; one without it is "untagged" and is inaccessible. See the Technical Details dropdown on the audit page for a full primer.
UUIDv4: A version 4 universally unique identifier — a 36-character random string with 122 bits of entropy. We use UUIDv4s as job IDs so that no two remediation jobs ever share an identifier, and so that an attacker cannot guess a valid job ID by enumeration.

13. Change log for this policy

v1.0 · 2026-05-18 — Initial publication. Covers tool versions v1.18.0 and newer. Documents the audit pipeline and the optional auto-remediation pipeline introduced in v1.18.0.

This policy is version-controlled with the source code. Any change to the data-handling behavior of the tool is reflected here, with a corresponding version bump and a dated entry above. The complete change history is available via git log apps/web/app/pages/data-retention.vue on the project's GitHub repository.

14. Contact & questions

For questions about this policy, requests for technical details beyond what's documented here, requests to inspect a specific job's audit trail, or any concern about how this tool handles data:

Innovation and Digital Services (IDS)

Illinois Criminal Justice Information Authority

cja.info@illinois.gov

Source code and issue tracker: github.com/ICJIA/file-accessibility-audit.

Accessibility Audit

Data Retention Policy

Key facts at a glance

Contents

1. Scope & applicable systems

2. Audit pipeline (always available)

3. Remediation pipeline (optional, gated)

4. AI usage statement

5. The open-source toolchain

6. Lifecycle audit trail

7. Retention periods by data category

8. What is and isn't stored

Stored (metadata only)

Never stored

9. Security & technical safeguards

10. Security audit history (red/blue team reviews)

What is a red/blue team audit, in plain language?

v1.20.1

Findings & what was done

Also added in this release — driven by the same security thinking

Methodology — for the auditor record

v1.20.0

Findings & changes

Operational improvements

v1.19.0

Findings & changes

v1.18.1

Findings

Operational improvements

v1.18.0

Findings

Pre-launch items still open

v1.17.0 and earlier

11. Right to inspect & verify

11. Standards & compliance alignment

12. Glossary of technical terms

13. Change log for this policy

Related documents & source code

14. Contact & questions