Sigil Data takes raw, publicly licensed UK legal source material — court judgements, tribunal decisions, regulatory enforcement notices, public-inquiry records — and converts it into the structured, expert-labelled training data that modern artificial-intelligence systems require. Every record is processed by a team of UK-qualified legal professionals, working strictly within the borders of the United Kingdom, on infrastructure held under UK jurisdiction. Nothing we touch leaves UK soil.
The United Kingdom is the source of one of the deepest publicly-licensed bodies of legal material in the world. It is also the jurisdiction whose artificial-intelligence companies, professional services firms and public sector now operate under hardening procurement requirements for jurisdictionally-clean training data — driven by recent rulings on training-data provenance, by the visible failure of foreign-controlled health-data initiatives, and by the maturing UK AI Bill consultation.
The incumbents who could have built a UK-sovereign answer to the AI-ready legal data problem are structurally locked out. The dominant commercial legal databases are foreign-owned and operate under licences that explicitly prohibit the use of their corpora for artificial-intelligence training. The offshore-labour annotation industry cannot make a credible sovereignty claim of any kind. The academic releases are partial, dated and not commercial-grade.
Sigil Data exists to close that gap. We take the raw material the United Kingdom already publishes openly, and we apply to it the only thing that turns raw legal text into training-grade structured data: the considered judgement of qualified legal practitioners, working at scale, under a disciplined apparatus.
Per-corpus structured datasets — Employment Tribunal, Tax Chamber, Property Chamber, Information Commissioner enforcement, financial-regulatory notices, public inquiry records, specialist court decisions. Each Codex is delivered as a Croissant-compatible JSON-Lines package with full case metadata, citation graph, ratio decidendi, reasoning chain, controlled-vocabulary outcome and quantum, and verified anonymisation. Ready to train against.
The proprietary framework under which every Codex is produced. The Apparatus was designed deliberately, by bridging the communication gap between senior legal professionals — drawing on a range of experiences across long careers in practice, on the bench and in regulatory work — and data scientists with a deep understanding of how data must be prepared to train modern artificial-intelligence systems. Neither discipline could have built it alone.
The result is a disciplined five-stage pipeline — triage, structural extraction, citation graph, ratio and reasoning chain, sampled senior review — that lets teams of UK-qualified legal professionals apply their legal fluency to raw source material in strict correlation to the framework, without re-inventing it on every record. Measurable inter-annotator agreement, calibration thresholds and a versioned change log keep the standard institutional, not personal. The Apparatus is the firm's principal intellectual asset and the reason our data is fit for AI.
A published benchmark evaluation set against which a customer's UK-legal artificial-intelligence system can be measured. A small, gold-standard reference corpus, hand-labelled by senior practitioners and a retired judicial reviewer, and held as the calibration point for every record we sigil. The reference against which the rest of the UK legal-AI market will be measured.
Every record we publish is drawn from a UK-government-licensed source under the Open Government Licence v3.0 or an equivalent public-sector permission. We do not use BCIS, Westlaw, LexisNexis, or any other foreign-controlled subscription corpus whose licence prohibits artificial-intelligence training use.
Every labelling judgement is made by a UK-qualified legal professional resident in the United Kingdom. Our paralegals, trainee solicitors and senior reviewers all hold relevant UK qualifications and operate from UK addresses. We do not subcontract any element of the labelling work to offshore providers, on principle.
Every dataset is stored, processed and delivered from infrastructure located inside UK jurisdiction. No record is sent to any third-party artificial-intelligence service for processing, including for redaction, classification or summarisation. The source data, the working data and the delivered data all stay on UK soil from intake to handover.
We attribute every record to its public-sector origin under the Open Government Licence v3.0. Where any reporting restriction applies to a source document we honour it in full, and we do not publish material outside the scope the original tribunal or court intended. Our anonymisation pipeline is audited against UK Information Commissioner's Office guidance.
Sigil Data is a venture of CSQS Ltd, a chartered quantity surveying practice in its sixth year, regulated by the Royal Institution of Chartered Surveyors. The firm operates without external equity. Initial discussions with prospective customers, partner law firms and senior legal practitioners are welcomed at the address below.