Blog › Business Projects

SAR Begone! Automating Redactions with AI

A true (but redacted) story about legal compliance, robot lawyers, and the surprisingly emotional experience of watching software do your filing.

There is a certain type of task that exists in every organisation. You know the one. It sits in a corner of the To Do list, technically urgent, legally significant, and deeply, profoundly tedious. Nobody wants to do it, everyone agrees it must be done, and the person who ends up doing it spends the next three weeks quietly resenting everything.

We are talking, of course, about Subject Access Requests.

For the uninitiated: a Subject Access Request (SAR) is a legal right that lets any individual ask an organisation to hand over all the personal data it holds on them. It sounds simple. It is not simple. It means trawling through years of emails, pulling out anything that mentions the person in question, carefully redacting anyone else's personal information from those same documents, and then packaging the whole lot up in a way that is both legally compliant and comprehensible to a human being.

It is, in short, the kind of job that makes a very strong case for tea.

The Old Way

Until recently, the standard approach to SAR redaction involved Adobe Acrobat, a very expensive subscription, and a member of the HR team who developed an increasingly complex relationship with the Ctrl+F shortcut.

The process went something like this. Export the emails. Open each one. Search for the relevant name. Find it. Redact it. Search for the next thing. Redact that. Check you haven't accidentally redacted a heading. Check you haven't missed a footer. Open the next email. Repeat. Repeat again. Keep repeating until either the task is complete or your wrists file their own formal complaint.

This is not a criticism of the people doing it. It is a criticism of the situation. Manual redaction of large email archives is genuinely, objectively, one of the most repetitive tasks a knowledge worker can be asked to perform. It is also one where the cost of a mistake is disproportionately high. Miss a phone number, leave in a colleague's home address, accidentally include someone's medical information, and suddenly you have a data breach on top of your SAR. The pressure is significant. The margin for human error, after hour four of staring at PDFs, is also significant.

And it costs a great deal of money. Adobe's enterprise licensing is not cheap. Neither is the staff time, billed at professional rates, to sit and do something a computer could, in principle, do faster, more consistently, and without ever needing an occy health assessment.

The Problem With Emails

Emails are wonderful and emails are terrible. They are how most of us conduct the vast majority of our working lives, and they therefore contain, in no particular order: important decisions, meeting scheduling that could have been a text message, a great deal of forwarded chain correspondence, the occasional accidental Reply All, and a frankly alarming quantity of personal information about people who have nothing to do with the subject of your SAR.

This is the central challenge. You cannot hand over an email that mentions your subject if it also contains someone else's home address, their salary, or their doctor's appointment. You have to redact it. You have to find it first. And you have to do all of this across, say, a few thousand emails.

A few thousand emails.

By hand.

This is where sensible people turn to automation.

Enter Claude, and the Concept of a Project

The new approach centres on Claude, an AI assistant made by Anthropic. But before we get into what Claude did, it is worth explaining how it was used, because the how matters as much as the what.

Claude was not simply asked questions in a chat window. It was set up as a working project: a persistent session with memory, context, and a shared workspace folder sitting on an actual computer. Think of it less like asking a search engine and more like bringing a very capable, very patient colleague up to speed on a specific piece of work, and then leaving them to get on with it while you make a coffee.

The shared workspace is the key bit. Claude could read files, write files, run code, check its own output, and save everything to a folder that the human could open and review at any point. There was no copy-pasting back and forth. No "here is the result, please now put it somewhere useful." The output just appeared, labelled, organised, in the right place. This is a genuinely different way of working with AI, and it takes a moment to adjust to, in the same way it takes a moment to accept that a slow cooker set going at eight in the morning will, without further intervention from you, have produced an actual meal by six in the evening.

Enter the Pipeline

What I built, over a rather intense series of sessions, was a full processing pipeline. It starts with a PST file (the format Microsoft Outlook uses to store emails, named, presumably, by someone who had given up on making things intuitive) and ends with a folder full of redacted PDFs, a detailed processing report, and enough CSV index files to keep a spreadsheet enthusiast happy for weeks.

The pipeline reads every single email. It extracts the text. It looks for names, phone numbers, email addresses and postcodes. It cross-references a known list of people who are not the subject of the request, and it blacks out their details wherever they appear. It handles email bodies, attachments, forwarded threads, and the peculiar formatting quirks that emerge when someone has copy-pasted something from a Word document into Outlook in 2019 and the line breaks have never quite recovered.

It also learned to be appropriately cautious about a two-letter combination that, in the right context, was indeed a person's initials, but in the wrong context would have blacked out part of an entirely unrelated word — or, in one memorable case, a chunk of someone's grandmother's shortbread recipe that had, for reasons nobody could adequately explain, made its way through corporate email.

A Report Worth Reading

The processing report itself became something of a project in its own right. Twelve sections. Tables of statistics. A detailed breakdown of every attachment type encountered, every category of redaction applied, every file that needed human review and why. The kind of document that says, clearly and in full sentences, "we did this thoroughly, we can account for every decision, and here is the evidence."

It also went through several rounds of correction. Turns out that when you have spent weeks staring at thousands of files, you occasionally remember a number wrong. Four CSV files. No, five. Three thousand and something attachments. No, more. The final version is right. The earlier versions are a good argument for always going back to look at the actual data rather than trusting your own memory of it.

Packing It Up to Go: READMEs, Skills, and What on Earth Is an MD File

Once the pipeline was working, the natural question arose: what if someone else needs to do this again? What if you need to do this again, in six months, having completely forgotten how any of it works?

This is a real problem with bespoke automation. You build something clever, it solves the problem, and then it sits in a folder on a server slowly becoming archaeology. Nobody touches it because nobody remembers how it works, and nobody wants to be the person who breaks it.

The answer, in this case, was documentation. Proper documentation. The kind that explains not just what the scripts do but why they make the decisions they make, in what order to run them, what can go wrong, and how to adapt them for a different subject.

This documentation lives in files with the extension .md, which stands for Markdown. If you have never encountered Markdown, it is essentially a very simple way of writing formatted text using only plain characters: a hash symbol makes a heading, asterisks make something italic or bold, a hyphen makes a bullet point. The file looks slightly odd if you open it in Notepad, and looks perfectly readable if you open it in almost anything else. GitHub likes it. Claude likes it. It has the considerable advantage of being future-proof, because plain text does not get corrupted, does not require a specific application to open, and will still be readable in twenty years when whatever word processor you are currently using has gone the way of WordPerfect.

We wrote a README (the traditional "please read this before you touch anything" document that software projects leave at their front door), a full pipeline guide, and a technical reference covering every function in every script. And then we went one step further and turned the whole thing into an installable skill.

A skill, in this context, is Claude's equivalent of a trained specialisation. It is a small package containing the key knowledge and instructions Claude needs to immediately understand a specific topic, rather than having to be explained from scratch each time. Install the skill, and Claude already knows what a PST file is in this context, what the output folders should look like, and which script to reach for when something has gone sideways. It is the difference between handing someone a filing cabinet and saying "good luck" versus sitting with them for an hour and walking them through the system.

The future self who picks this up in six months will be grateful. The future self always is, when past self has done the filing properly.

One More Thing

Right at the end, we pulled out all the email attachments, raw and unredacted, and sorted them into folders by sender. Six thousand, six hundred and ninety-six files. Images, spreadsheets, PDFs, old Word documents in formats that date back to an era when "save as" had fewer options. All organised, all labelled with the email sequence number and the date, all sitting neatly in their respective folders.

How Long Would This Have Taken the Old Way?

This is the question people always want answered, and the honest answer is: longer than you have.

A SAR must be completed within one calendar month. That is the legal requirement, and it does not move. Two thousand, eight hundred and fifty emails at a conservative ten minutes each to open, read, search, redact, double-check and close is just under 475 hours of skilled staff time. The better part of twelve weeks, full-time, for one person. Fitted around everything else HR has to do, in calendar terms it would blow past the deadline without breaking a sweat, and bring a formal compliance failure with it.

Then there are the attachments. Six thousand, six hundred and ninety-six files to locate, review, and sort by hand. At even two minutes each, that is another 220 hours. Add the time spent writing the processing report, chasing down inconsistencies in the numbers, and organising the final output into something a solicitor can work with, and you are comfortably north of 700 hours of professional time.

The pipeline processed the same archive in a matter of hours, across several sessions of setup, checking, and refinement. The whole thing came in at roughly the cost of one month's Claude subscription, my wage and a whole box of Yorkshire Tea.

Seven hundred hours of skilled staff time, a legal deadline that does not negotiate, and a compliance risk that very much does have a price: all of it handled for approximately what you would spend on a single Adobe Acrobat licence renewal. The maths is not subtle.

What We Actually Learned

Automation does not remove the need for judgement. It removes the need to apply that judgement to the same problem six thousand times. You still have to decide what a redaction means. You still have to check the edge cases. You still have to read the output with human eyes and ask whether it makes sense.

The Adobe subscription hopefully gets cancelled at some point. But instead of spending that money on repetitive manual work, you spend a fraction of it once, building something that does the repetitive part for you, and then you spend the rest on people doing the things that actually require a person.

That feels like progress. And it was considerably kinder on everyone's wrists.

Total emails processed: 2,850. Total attachments extracted: 6,696. Cups of tea consumed during the project: untracked, but the estimate is substantial.