In my previous article, I explored whether GPT-4 could replace a policy expert. I tested the model using a series of prompts to assess its performance on key tasks. It became clear that while GPT-4 can help with summarising legislative texts and translating policy into plain language, the model is too limited to take over the workload fully.

I wonder if we can go beyond simple prompting to create a capable assistant — one that uses AI to aid the work of a policy expert. I sat down with a computer scientist to explore whether it is possible to develop a digital assistant specifically for EU Policy Experts. So we approach this challenge in three steps: defining tasks, selecting the model architecture, and evaluating its outputs.

Defining tasks

A policy expert navigates legal, technical, and organisational contexts, communicates clearly to diverse audiences, and produces outputs that reflect the structure and complexities of policy practice. In the previous article, we outlined and tested five key steps a policy expert typically follows in their work:

Identifying applicable legislation for specific products, markets, or activities
Interpreting complex legal texts and summarising them in plain language
Translating legal obligations into actionable requirements for procurement, design, or reporting
Drafting internal policy documents, checklists, or technical guidance
Supporting advocacy efforts through consultation responses or stakeholder briefings

We want the assistant to support the expert with tasks such as summarising legislation, extracting legal obligations, or drafting initial guidance.

2. Choosing the Right Architecture

Once we have defined what tasks we want the assistant to perform, we have to decide how we should build the system. There are several options for applying LLMs like GPT-4 or Claude to domain-specific tasks. Each comes with its strengths and limitations. We will look at five main options, starting from the simplest and moving toward more advanced setups.

1. Prompting a General-Purpose Language Model

Our original approach was to use a general model like GPT-4 “as-is” by writing carefully crafted prompts. It’s fast, flexible, and useful for drafting emails and summarising legal text in plain language. However, it’s also the least reliable. The model has no built-in awareness of what’s legally accurate or up to date, and it can hallucinate. For high-stakes policy or compliance work, prompting alone isn’t enough, as we saw in the previous article.

2. Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is a more robust and reliable approach. It combines a large language model with a curated database of trusted documents — such as EU legislation, official guidance documents, or templates. When asked a question, the model retrieves relevant content from this database and generates its response based on that material. RAG is far more accurate because it bases every answer on real documents. The system is flexible, as new legislation or guidance can be added to the database without retraining the model itself.

That said, RAG is not without challenges. It takes time and effort to set up. We also need technical support to build and maintain the underlying retrieval infrastructure. And while RAG reduces hallucinations, the quality of responses still depends on how clearly and precisely the questions are asked.

3. Fine-Tuning a Domain-Specific Model

Another option is to fine-tune a smaller model using your own data — such as policy briefs, internal compliance documents, or sector-specific templates. This can help the assistant adopt a specific tone, structure, or set of preferred phrases, making it useful for automating repetitive internal work.

Nevertheless, fine-tuning has its limitations. The training process is lengthy and costly, and once trained, the model is essentially static — it won’t be aware of new laws unless it is retrained. It is therefore not ideal for fast-paced developments in Brussels.

4. Agent-Based Systems

Agents go a step further by allowing the model to carry out a sequence of actions or steps. For example, an agent might identify a product category, retrieve relevant legislation, extract the obligations, and then draft a compliance checklist — all within one workflow.

This sounds promising, especially for more complex or repetitive workflows. However, agent-based systems are harder to control and often less predictable. They also lack transparency. Agents may be useful for automating well-defined internal processes, but they currently lack the reliability for legal interpretations.

5. Emerging Multi-Step Planning Tools

Finally, there are several emerging tools — such as LangGraph, DeepResearch, or Anthropic’s Constitutional AI — that are designed to guide a model’s reasoning through multiple steps without granting it complete autonomy. For example, it becomes easier to inspect intermediate steps, evaluate logic, and reduce errors in long-form outputs. These tools are still new and not yet widely accessible.

Conclusion

Each of these approaches offers a different path toward building an AI assistant for policy work. Prompting is fast but fragile. Fine-tuning offers customisation but lacks flexibility. Agent systems are ambitious but often difficult to control. Emerging multi-step tools are promising, but not yet mature.

For our use case — supporting regulatory compliance, policy interpretation, and communication — we’ve chosen to build on Retrieval-Augmented Generation (RAG). It allows us to combine the language capabilities of GPT-4 with the accuracy and traceability of real legal documents.

3. Evaluation and Testing

Once we have developed an initial version of the system, the next step is to assess whether the output is accurate.

To ensure the assistant functions as intended, we combine automated checks with human oversight. We compare its answers to expert-written reference outputs to establish a performance baseline, and use other language models to evaluate how closely its responses match the expected results. In some cases, we run side-by-side comparisons and ask another model to choose the better answer, while recognising that this still reflects machine reasoning. Human experts then review outputs for accuracy, clarity, relevance, and potential risks. Finally, we test the assistant in real policy and compliance workflows to see if it saves time, reduces rework, and genuinely supports better decisions.

What Comes Next

In the weeks ahead, we’ll begin training and testing a retrieval-based system designed to assist with real EU policy and compliance tasks. Our goal is not to replace human experts, but to explore where AI can offer practical support — and where it still falls short.

We’ll share updates on what we learn, including:

What kinds of tasks can the model handle reliably
Where human supervision remains essential
How to balance automation, traceability, and professional responsibility

If we get this right, the result won’t be a replacement for policy expertise — but a valuable assistant that allows us to focus on the task at hand.

Defining tasks

2. Choosing the Right Architecture

3. Evaluation and Testing

What Comes Next