Technical Guide·14 min read

EUR-Lex CELLAR API: The Developer's Guide to EU Legal Data

Everything you need to query the EU Publications Office CELLAR SPARQL endpoint, decode ELI identifiers, and build regulatory data pipelines. Includes working code examples, rate-limit guidance, and architecture patterns.

What Is CELLAR?

CELLAR is the semantic repository operated by the Publications Office of the European Union (OP). It stores the metadata and content of every document published on EUR-Lex — treaties, regulations, directives, decisions, court judgments, and preparatory acts — as RDF triples. The dataset currently holds over 2.7 million work entries, each identified by unique URIs, and exposes them through two primary interfaces: a SPARQL endpoint for structured metadata queries and a REST API for content retrieval.

For developers building compliance tools, regulatory monitoring systems, or legal research platforms, CELLAR is the canonical upstream source for EU legal data. Unlike screen-scraping EUR-Lex HTML pages, the CELLAR API returns machine-readable RDF using the CDM (Common Data Model) ontology — a stable, versioned schema maintained by the Publications Office. This means your integrations won't break every time EUR-Lex redesigns its frontend.

The SPARQL Endpoint

The primary interface for querying CELLAR metadata is the public SPARQL endpoint:

https://publications.europa.eu/webapi/rdf/sparql

This endpoint accepts SPARQL 1.1 queries via HTTP GET or POST. For GET requests, pass your query as the query parameter. For POST, send it as application/x-www-form-urlencoded body. The endpoint supports standard content negotiation — set the Accept header to application/sparql-results+json for JSON output, application/sparql-results+xml for XML, or text/csv for CSV.

The underlying triple store is an OpenLink Virtuoso instance. Queries time out after 60 seconds, so you should always use LIMIT and OFFSET for pagination, and be precise with your graph patterns to avoid full scans.

Understanding ELI Identifiers

The European Legislation Identifier (ELI) is a URI scheme adopted by EU member states and institutions to provide persistent, machine-readable references for legal documents. An ELI looks like this:

http://data.europa.eu/eli/reg/2024/1689/oj

That identifier points to Regulation (EU) 2024/1689 — the EU AI Act — as published in the Official Journal. The path segments encode the document type (reg for regulation, dir for directive, dec for decision), the year, the document number, and the publication context.

ELI identifiers are hierarchical. A single work can have multiple expressions (language versions) and multiple manifestations (format versions: PDF, HTML, XHTML). In CELLAR, the CDM ontology links works, expressions, and manifestations using FRBR-aligned relationships: cdm:work_has_expression, cdm:expression_belongs_to_work, and cdm:expression_has_manifestation.

CELEX Numbers Explained

Before ELI, the EU used CELEX numbers as the primary document identifiers. They are still widely used in practice and supported throughout CELLAR. A CELEX number has the format:

[Sector][Year][Document Type][Number]
Example: 32024R1689

3    = Sector 3 (Secondary legislation)
2024 = Year of adoption
R    = Regulation
1689 = Sequential number

The sector digit tells you the legal domain: 1 for treaties, 2 for international agreements, 3 for secondary legislation (regulations, directives, decisions), 4 for complementary legislation, 5 for preparatory acts, 6 for case-law, and so on. In CELLAR, every document has a cdm:resource_legal_id_celex property you can query directly.

Example SPARQL Queries

Below are three practical queries you can run directly against the CELLAR endpoint. All use the CDM namespace prefix PREFIX cdm: <http://publications.europa.eu/ontology/cdm#>.

Query Recent EU Regulations

This query returns the 20 most recently published EU regulations, with their CELEX number, title, and date of document:

PREFIX cdm: <http://publications.europa.eu/ontology/cdm#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT DISTINCT ?celex ?title ?date
WHERE {
  ?work cdm:work_has_resource-type
        <http://publications.europa.eu/resource/authority/resource-type/REG> .
  ?work cdm:resource_legal_id_celex ?celex .
  ?work cdm:work_date_document ?date .
  ?work cdm:work_has_expression ?expr .
  ?expr cdm:expression_uses_language
        <http://publications.europa.eu/resource/authority/language/ENG> .
  ?expr cdm:expression_title ?title .
}
ORDER BY DESC(?date)
LIMIT 20

Query by Subject Matter

To find all directives classified under the EuroVoc descriptor for "artificial intelligence" (concept ID 231611):

PREFIX cdm: <http://publications.europa.eu/ontology/cdm#>

SELECT DISTINCT ?celex ?title
WHERE {
  ?work cdm:work_has_resource-type
        <http://publications.europa.eu/resource/authority/resource-type/DIR> .
  ?work cdm:resource_legal_id_celex ?celex .
  ?work cdm:work_is_about_concept_eurovoc
        <http://eurovoc.europa.eu/231611> .
  ?work cdm:work_has_expression ?expr .
  ?expr cdm:expression_uses_language
        <http://publications.europa.eu/resource/authority/language/ENG> .
  ?expr cdm:expression_title ?title .
}
LIMIT 50

Get Document Metadata

Retrieve comprehensive metadata for a specific regulation by CELEX number — including its ELI, date of entry into force, subject areas, and available language versions:

PREFIX cdm: <http://publications.europa.eu/ontology/cdm#>

SELECT ?eli ?dateForce ?dateDocument ?subject ?lang
WHERE {
  ?work cdm:resource_legal_id_celex "32024R1689" .
  OPTIONAL { ?work cdm:resource_legal_eli ?eli . }
  OPTIONAL { ?work cdm:work_date_document ?dateDocument . }
  OPTIONAL {
    ?work cdm:resource_legal_date_entry-into-force ?dateForce .
  }
  OPTIONAL {
    ?work cdm:work_is_about_concept_eurovoc ?subject .
  }
  OPTIONAL {
    ?work cdm:work_has_expression ?expr .
    ?expr cdm:expression_uses_language ?lang .
  }
}

REST API for Content Retrieval

While SPARQL handles metadata queries, the CELLAR REST API is what you use to retrieve actual document content — full-text HTML, PDF, or Formex XML manifestations. The base URL follows the pattern:

GET https://publications.europa.eu/resource/cellar/{cellar-id}
Accept: application/xhtml+xml; application/pdf; text/html

You obtain the cellar-id from your SPARQL results — it is the UUID portion of the manifestation URI. You can also use content negotiation with ELI URIs directly. Sending a GET to http://data.europa.eu/eli/reg/2024/1689/oj with an Accept: application/xhtml+xml header will redirect to the XHTML manifestation in CELLAR.

For programmatic access, the XHTML manifestations are the most useful. They contain structured, semantic markup with article-level elements identified by eId attributes from the Akoma Ntoso naming convention. This structure makes it possible to extract individual articles, recitals, or annexes without parsing unstructured HTML.

The Pillar IV Atom Feed

For near-real-time monitoring of new and updated documents, the Publications Office provides the Pillar IV notification service via Atom feeds. The feed URL is:

https://publications.europa.eu/webapi/notification/

Each feed entry contains the CELLAR URI of the affected resource, the type of change (creation or modification), and a timestamp. You can filter the feed by document type, date range, and other parameters. The feed typically updates within a few hours of a document being published on EUR-Lex.

In practice, most regulatory monitoring systems poll this feed every 15–30 minutes, diff the results against a local database, and trigger downstream processing for new entries. This is significantly more reliable than scraping the EUR-Lex website and avoids being blocked by rate limits on the HTML frontend.

Rate Limits and Best Practices

The CELLAR SPARQL endpoint is a public service with no authentication requirement, but there are practical limits you need to respect:

  • Query timeout: 60 seconds. Complex queries with unbound variables or missing filters will be terminated. Always use LIMIT and constrain your graph patterns.
  • Concurrent connections: The endpoint throttles by IP. Keep concurrent requests below 5 and implement exponential backoff on HTTP 429 or 503 responses.
  • Payload size: Large result sets (10,000+ rows) should be paginated using OFFSET and LIMIT. Requesting everything in one query will either time out or produce incomplete results.
  • Caching: Document metadata rarely changes after initial publication. Cache CELEX-to-URI mappings, ELI identifiers, and document metadata locally. Only re-fetch content when the Pillar IV Atom feed signals a modification.
  • User-Agent header: Set a descriptive User-Agent header identifying your application. This helps the OP team understand usage patterns and is considered a courtesy.

If you need higher throughput or guaranteed SLAs, the Publications Office offers bulk download packages of the full EUR-Lex dataset in RDF format, updated weekly. These are available from the EU Open Data Portal and can be loaded into your own Virtuoso or Apache Jena instance.

Building a Regulatory Data Pipeline

A production-grade pipeline for consuming EU legal data from CELLAR typically has four stages:

1. Discovery

Poll the Pillar IV Atom feed on a schedule (every 15–30 minutes). Parse each entry, extract the CELLAR URI, and check your local database for duplicates. Queue new or modified URIs for processing.

2. Metadata Enrichment

For each new URI, run a targeted SPARQL query to retrieve the CELEX number, ELI, title, date of document, date of entry into force, EuroVoc subject descriptors, and available language expressions. Store this as structured metadata in your database.

3. Content Retrieval

Fetch the XHTML manifestation using the REST API. Parse the Akoma Ntoso structure to extract individual articles, recitals, and annexes. This article-level granularity is critical for compliance workflows — teams need to track changes to specific articles, not entire documents.

4. Diffing and Alerting

When a document is modified (corrigenda, consolidated versions), compare the new article-level content against your stored version. Generate a structured diff showing which articles changed, what was added, and what was removed. Push alerts to downstream consumers — email notifications, Slack webhooks, or dashboard updates.

This architecture lets you serve real-time regulatory intelligence from a reliable public data source without depending on fragile web scraping or expensive commercial data vendors.

How Polzia Uses CELLAR

At Polzia, CELLAR is one of over 200 regulatory sources we monitor across 21 markets. Our pipeline follows the architecture described above with one key addition: article-level diffing. When a regulation is amended or a consolidated version is published, our system parses the Akoma Ntoso XHTML into individual articles and generates a clause-by-clause comparison against the previous version.

This means compliance teams see exactly which articles changed — not just that "Regulation (EU) 2024/1689 was updated." They get a structured diff showing inserted paragraphs, deleted provisions, and modified definitions, with each change classified by severity. Combined with EuroVoc classification and our AI-powered relevance scoring, teams can focus their review time on the changes that actually affect their obligations.

Common Pitfalls

  • Assuming one CELEX per regulation: Consolidated texts, corrigenda, and amendments each have their own CELEX numbers. A single logical regulation may correspond to dozens of CELEX entries. Use the cdm:work_related_to and cdm:consolidated_by relationships to navigate between them.
  • Ignoring language expressions: Many regulations exist in 24 language versions. If you omit the language filter in your SPARQL query, you will get 24x the expected results. Always filter on cdm:expression_uses_language.
  • Hardcoding URIs: Authority table URIs (resource types, languages, EuroVoc concepts) occasionally change. Fetch the current authority tables from the EU Vocabularies site and cache them locally rather than embedding raw URIs in your queries.
  • Not handling Formex XML: Older documents (pre-2014) may not have XHTML manifestations. They are stored in Formex XML format, which has a different schema. Your parser needs to handle both formats.

Additional Resources

  • CDM Ontology Documentation: The full Common Data Model specification is published by the Publications Office, detailing every class and property available in CELLAR.
  • EU Vocabularies: The authority tables for resource types, languages, EuroVoc descriptors, and other controlled vocabularies used in CELLAR queries.
  • ELI Technical Specification: The Council conclusions on ELI and the technical documentation for implementing European Legislation Identifiers.
  • EUR-Lex SPARQL Webinar: The Publications Office periodically runs free webinars demonstrating CELLAR queries. Recordings are available on their website.

Stop Building Pipes, Start Shipping Insights

Building a reliable CELLAR integration takes weeks of engineering time — handling edge cases, parsing multiple document formats, managing rate limits, and keeping your pipeline running. Polzia does all of this out of the box, plus 200+ additional regulatory sources across 21 markets. Get article-level diffs, AI severity scoring, and compliance workflows without maintaining the infrastructure yourself.