UK Biobank health data keeps ending up on GitHub

That's the least of it: https://www.bbc.co.uk/news/articles/cpvxgl3n138o

All 500,000 participants for sale on Alibaba...

And official response: https://www.ukbiobank.ac.uk/news/a-message-to-our-participan...

> It has given 20,000 researchers around the world access under strict agreements that prohibit sharing data further.

To me it seems rather naive to have done that.

After all, you can't un-leak medical data. So even if the "strict agreement" included huge punishments, there's no getting the toothpaste back in the tube.

If you want to ensure compliance before a leak happens you have to (ugh) audit their compliance. And that isn't something that scales to 20,000 researchers.

Too late to do anything about it now though :(

The irony is, they don’t even provide the data to the participants themselves.

Took me 5 minutes to find more: https://github.com/tanaylab/Mendelson_et_al_2023/blob/9c5a65... (Uses Date of Birth column).

And some information on how they were distributing it to researchers: https://github.com/broadinstitute/ml4h/blob/master/ingest/uk...

> The following steps require the ukbunpack and ukbconv utilities from the UK Biobank website. The file decrypt_all.sh will run through the following steps on one of the on-prem servers.

> Once the data is downloaded, it needs to be "ukbunpacked" which decrypts it, and then converts it to a file format of choice. Both ukbunpack and ukbconv are available from the UK Biobank's website. The decryption has to happen on a linux system if you download the linux tools, e.g. the Broad's on-prem servers. Note that you need plenty of space to decrypt/unpack, and the programs may fail silently if disk space runs out during the middle.

https://biobank.ctsu.ox.ac.uk/crystal/download.cgi

What are the pros/cons of just open-sourcing everything for future bio bank projects?

Took me 5 minutes to find more: https://github.com/tanaylab/Mendelson_et_al_2023/blob/9c5a65... (Uses Date of Birth column).

And some information on how they were distributing it to researchers: https://github.com/broadinstitute/ml4h/blob/master/ingest/uk...

> The following steps require the ukbunpack and ukbconv utilities from the UK Biobank website. The file decrypt_all.sh will run through the following steps on one of the on-prem servers.

https://biobank.ctsu.ox.ac.uk/crystal/download.cgi

Good catch! The data is everywhere, re-uploaded every week.

I am aware of ~30 repositories that UK Biobank has asked GitHub to delete, and can still be found elsewhere online. They know the site, they have managed to delete data from that site before, and yet the files are still there.

> It has given 20,000 researchers around the world access under strict agreements that prohibit sharing data further.

To me it seems rather naive to have done that.

After all, you can't un-leak medical data. So even if the "strict agreement" included huge punishments, there's no getting the toothpaste back in the tube.

If you want to ensure compliance before a leak happens you have to (ugh) audit their compliance. And that isn't something that scales to 20,000 researchers.

Too late to do anything about it now though :(

That's the least of it: https://www.bbc.co.uk/news/articles/cpvxgl3n138o

All 500,000 participants for sale on Alibaba...

And official response: https://www.ukbiobank.ac.uk/news/a-message-to-our-participan...

One of the favorite lessons I learned is that anything at scale has to be designed for idiots. I am pretty sure every person reading this has had days where they have done absolutely stupid things without realizing. Now assume there are thousands of users, and you could be providing tools to the smartest people in the world and still have people do stupid stuff all the time. This doesn't just apply to UX.

Then there's the question of trust. You probably have friends you know not to tell certain secrets to, because they believe they get to delegate your secrets onwards to people they trust. The further away someone is from you, the less respect they will show. Researchers have been loaning the dataset in good faith to people who they trust, but who probably didn't take the whole secrecy thing as seriously.

With 20k researchers this was inevitable. The kind of factors above need to be factored in when designing on what grounds such a dataset is to be released.

Not giving the data to researchers means not getting the scientific benefits from that data. Which was the point of collecting that data in the first place.

Reckless harm prevention is the root of many evils.

That’s insane. And what does researcher even mean - some random university student? What would they know about securing that data? I wonder if the people whose data is out there even know this is happening

The irony is, they don’t even provide the data to the participants themselves.

Huh? I got my report over email. I think you have to ask for it.

What are the pros/cons of just open-sourcing everything for future bio bank projects?

It's exceptionally difficult to avoid the data being de-anonymised.

If an 'anonymised' medical record says the person was born 6th September 1969, received treatment for a broken arm on 1 April 2004, and received a course of treatment in 2009 after catching the clap on holiday in Thailand - that's enough bits of information to uniquely identify me.

And medical researchers are usually very big on 'fully informed consent' so they can't gloss over that reality, hide it in fine print or obsfucate it with flowerly language. They usually have to make sure the participants really understand what they're agreeing to.

It might still work out fine, of course - 95% of people's medical histories don't contain anything particularly embarrassing, so you might be able to get plenty of participants anyway.

The people who agreed to contribute their biodata did not consent to that.

If you want such a project you need to have a new project with a different agreement. I doubt you could get as many volunteers to freely give away such intimate data to anyone who wants though

You mean giving anyone access to the data? Or open sourcing the code? If the latter, I think that's a generally a good practice. Security through obscurity is never good for public infrastructure. In this case, UK Biobank has now switched to a remote access platform (not particularly secure, as the data was found for sale on Alibaba today), but contracting it to DNAnexus and Amazon. Private companies have no incentives to open source data, unless mandated to do so.

In the EU, there is a bigger interest in building scalable but also secure platforms for health data. Hopefully good innovation will come from there.

One of the most important "con"'s is that without controls, fewer people will allow their data to be included in the data sets.

'Anonymisation' schemes are a little like encryption, in that they just get monotonically weaker over time as people work out attacks. But the attacks tend to be much worse. I work in academic open data publishing, and the netflix prize (https://arxiv.org/abs/cs/0610105) hangs over our heads.

But what this illustrates to me is that researchers are just really careless, despite everything we make them agree to in data transfer agreements. It seems absurd to have little cubicles like this https://safepodnetwork.ac.uk/ (think Mission Impossible 1) but I do despair.

They need to sell the data to fund the project

Hard to do. The same people with the collection and tracking infrastructure required are infinitely sue-able so you need legal protection if anything goes wrong.

Good catch! The data is everywhere, re-uploaded every week.

Huh? I got my report over email. I think you have to ask for it.

It's exceptionally difficult to avoid the data being de-anonymised.

It might still work out fine, of course - 95% of people's medical histories don't contain anything particularly embarrassing, so you might be able to get plenty of participants anyway.

... received a course of treatment in 2009 after catching the clap on holiday in Thailand

Yeah, sorry about that

In my experience with health data, the dates are usually offset by a random but constant amount for each person (e.g. id 12345 will have all their dates shifted by +5 weeks) to avoid identification by dates.

Unfortunately the sequence of treatments and locations are usually enough to identify someone, especially if it's a rarer condition.

Not giving the data to researchers means not getting the scientific benefits from that data. Which was the point of collecting that data in the first place.

Reckless harm prevention is the root of many evils.

As a biostatistician who's touched epidemiological studies, I'd argue losing the trust of participants and the public is one of the biggest threats to the viability of the whole research enterprise. It's reckless to jeopardize that as well. Conversely, this dataset will be mined for at least 30-50 years - there are an infinite number of questions that can be asked of this dat. Given that timescale, I think a little delay here is acceptable.

With 20k researchers this was inevitable. The kind of factors above need to be factored in when designing on what grounds such a dataset is to be released.

... received a course of treatment in 2009 after catching the clap on holiday in Thailand

Yeah, sorry about that

Unfortunately the sequence of treatments and locations are usually enough to identify someone, especially if it's a rarer condition.

Location data is very readily available, so you can easily correlate visits to a health facility with a treatment, and even with an offset, you can probably uniquely identify someone with 4 visits depending on the size of the medical facility.

One of the most important "con"'s is that without controls, fewer people will allow their data to be included in the data sets.

That's a very important point. The people who opt out first are typically not a random fraction of the population, and this makes it much harder to make any analyses with the resulting datasets: it gets very hard to know if your analyses are representative of the population, or not.

The people who agreed to contribute their biodata did not consent to that.

If you want such a project you need to have a new project with a different agreement. I doubt you could get as many volunteers to freely give away such intimate data to anyone who wants though

In the EU, there is a bigger interest in building scalable but also secure platforms for health data. Hopefully good innovation will come from there.

They need to sell the data to fund the project

The people involved are volunteers. The rules for getting access are readily available, and clearly don't include "some random university student": https://www.ukbiobank.ac.uk/about-us/how-we-work/access-to-u...

Hard to do. The same people with the collection and tracking infrastructure required are infinitely sue-able so you need legal protection if anything goes wrong.

Really don't think this is any issue given the post we are commenting on...

I had access to several health datasets for my research in the past. Date of birth was rarely given, especially for the bigger projects where there were more resources to allocate to privacy protection. Neither was date of death, location, or visits to a health facility with a treatment. Typically the relevant variables are age (in years), treatment type and possibly number of cycles. Probably insufficient to identify someone without access to hospital records. But if you have that, you have all these data anyways.

Most researchers likely would want to summarize these data in a similar way anyway, so this works out nicely.

Really don't think this is any issue given the post we are commenting on...

This is why it was such a big deal when that researcher at Cleveland State misappropriated UKBB data for a race-science study with Emil Kirkegaard. After he was fired, people on Twitter were all like "this is just suppression of science", but the reality is that what they did, contravening UKBB rules, constituted potentially an existential threat to the whole program.

They clearly do include "some random student" as the data can be shared with others from the eligible research group which are almost always university students who have zero clue about itsec.

Most researchers likely would want to summarize these data in a similar way anyway, so this works out nicely.

They clearly do include "some random student" as the data can be shared with others from the eligible research group which are almost always university students who have zero clue about itsec.

I worked in this field. It's not just the students. Hardly anyone seemed to understand how and why you would keep data out of a git repo.

— days since last takedown!

UK Biobank holds genetic, health, and lifestyle data on half a million British volunteers. It has given 20,000 researchers around the world access under strict agreements that prohibit sharing data further. And yet, researchers are repeatedly uploading participant data by mistake to public GitHub repositories.

According to The Guardian, UK Biobank has been closely monitoring the situation, contacting researchers directly then issuing takedown notices when repositories are not being deleted—sometimes by researchers and students Biobank never gave data in the first place.

This tracker monitors the 110 notices filed so far, targeting 197 code repositories by 170 developers across the world, using public data from GitHub's DMCA archive.

From only two pieces of information (approximate date of birth and date of a single major surgery), the Guardian was able to re-identify a volunteer in one of the exposed datasets. For BMJ, Jess Morley and I argue that UK Biobank is harming participants by dismissing re-identification risks but advising them to now limit what they share online. Institutions like Biobank must demonstrate humility, a commitment to listening to privacy experts, and a willingness to learn.

Built by Luc Rocher, Oxford Internet Institute, University of Oxford

What is UK Biobank trying to take down

UK Biobank uses copyright takedown notices, a mechanism often associated with removing pirated software and stolen code, to remove health data from GitHub. The UK has no equivalent of DMCA for privacy breaches that would compel a platform to act so quickly.

Looking at the takedown notices, we often see specific files being targeted rather than entire repositories—possibly to justify the copyright infringement as required for a takedown notice. Nearly half are Jupyter or R notebooks, which can contain a few rows of data. A quarter are genetic and genomic data files (PLINK, BOLT-LMM, BGEN) that directly encode participant genotypes or association results. Tabular datasets (CSV, TSV, Excel, and serialised R objects) account for another large share and could contain phenotype or health records. The remainder includes analysis scripts, documentation, and compressed archives.

Timeline of takedown notices

The first takedown notice was filed in July 2025. Since then, the pace has been steady, with a total of 110 requests to GitHub. Interestingly, the requests stopped in January, February, and most of March 2026. It's hard to believe that no researcher has mistakenly uploaded UK Biobank data during these months. The notices restarted end of March, just after the Guardian's investigations revealed the ongoing data exposure and the ineffectiveness of takedowns.

Where in the world

Developers targeted by UK Biobank's takedown notices are based in at least 14 countries. The true number is likely higher: of the 170 developers identified in the notices, only 75 list a location on their GitHub profile. Most appear to be from United States and China.

24 United States
21 China
7 United Kingdom
5 Germany
4 Hong Kong
4 Australia
3 Spain
1 South Korea
1 Greece
1 Qatar
1 United Arab Emirates
1 Switzerland
1 India
1 Netherlands

Methodology

To build this webpage, I used data from the github/dmca repository, where GitHub publishes the full text of every DMCA takedown notice it receives. When a rights holder asks GitHub to remove content that infringes their copyright, the notice is posted publicly as a Markdown file in this repository. According to The Guardian, UK Biobank has used this process to request the removal of files or repositories that contain (or that it believes contain) participant data covered by its data access agreements.

To identify UK Biobank-related notices, I match filenames containing the slug "uk-biobank" (the convention GitHub uses when naming notice files). Just in case, I also search the full text of every other notice file for the phrases "UK Biobank" or "UKBiobank" (case-insensitive) to catch notices filed under different slugs, such as those submitted on behalf of UK Biobank. From each matching notice, I extract the filing date (parsed from the filename, which follows GitHub's YYYY-MM-DD-slug.md convention) and all GitHub repository URLs mentioned in the notice body. URLs pointing to GitHub's own infrastructure (e.g. github.com/contact or github.com/site) are excluded.

For each unique GitHub username found in the notices, I query the GitHub REST API (GET /users/{username}) to retrieve the user's public profile, specifically the self-reported location field. This is a free-text string that users enter voluntarily. It may be a city, a country, a university name, or left blank entirely. Deleted accounts return a 404 and are not included further.

I derive countries from the raw location strings by hand. When a user's GitHub profile does not include a location, I also determine their country by inspecting their GitHub profile and associated email address domains. This process is inherently imperfect: some locations are ambiguous (e.g. "Cambridge" could refer to the UK or the US), and many users do not provide any location at all. Of the 170 unique developers in the dataset, only 75 have a location that could be resolved to a country.

The data is regularly refreshed by re-running the collection script against the latest state of the github/dmca repository. This page does not make any claims about the content of the targeted repositories, including whether they contained actual participant data, derived datasets, analysis code, or just documentation. It reports only what is visible in the public DMCA notices filed by UK Biobank.

Hacker Times