Summary: Genetic, medical, and lifestyle data from all 500,000 UK Biobank volunteers was listed for sale on Alibaba after three Chinese research institutions with legitimate access violated their data-sharing agreements. The data was de-identified but includes genome sequences, hospital diagnoses, and biological measures that experts say can be re-identified. Alibaba removed the listings before any sales were made, UK Biobank has paused all external data access, and the ICO is investigating. A March investigation had already found the data leaked dozens of times via GitHub.
The genetic, medical, and lifestyle data of 500,000 British volunteers was listed for sale on Alibaba’s e-commerce platform in China this week, the UK government confirmed on Wednesday, in a breach that did not require a single line of malicious code. Three research institutions in China that had been granted legitimate access to UK Biobank’s database downloaded the data, then listed it for sale. It was not a hack. It was a contract violation by trusted researchers, and that distinction makes it worse, not better, because it exposes a vulnerability that no firewall can fix: the entire model of open research data sharing assumes that everyone who receives the data will follow the rules.
Ian Murray, the Minister of State, told the House of Commons that UK Biobank informed the government on Monday 20 April that three listings had been identified on Alibaba, with at least one appearing to contain data from all 500,000 participants. The data was de-identified, meaning it did not include names, addresses, contact details, or NHS numbers. It did include gender, age, month and year of birth, socio-economic status, lifestyle habits, and measures from biological samples. With support from both the UK and Chinese governments, Alibaba removed the listings before any sales were made. The three institutions had their access revoked. UK Biobank has paused all external data access while it develops a technical solution to prevent bulk downloads, and has referred itself to the Information Commissioner’s Office.
What UK Biobank holds
UK Biobank is one of the most valuable biomedical research resources in the world. Between 2006 and 2010, it recruited 500,000 volunteers aged 40 to 69 across Great Britain, who consented to share their health data and be followed for at least 30 years. The database now holds more than 10,000 variables per participant, including whole genome sequences for all 500,000 volunteers (released in full in 2023), blood and urine biomarkers, brain and body imaging scans, hospital diagnosis records, GP data, and detailed lifestyle questionnaires. Approximately 22,000 researchers worldwide have access to the data for approved studies into cancer, heart disease, diabetes, Alzheimer’s, and other conditions. The resource has generated thousands of peer-reviewed papers and is considered foundational to modern genomic medicine.
The data is shared on the basis that it is de-identified. Researchers sign material transfer agreements prohibiting redistribution. The model depends on compliance with those agreements. What happened this week is that three institutions broke the agreement, and the only reason anyone knows is that they were brazen enough to list the data for sale on a public marketplace.
The re-identification problem
The government’s assurance that the data did not contain names or addresses is accurate but incomplete. A Guardian investigation published in March found that de-identified UK Biobank data had been exposed online dozens of times, with researchers inadvertently posting partial or complete datasets to GitHub, the code-sharing platform. Between July and December 2025, UK Biobank issued 80 legal notices to GitHub requesting removal. In one case, a dataset containing millions of hospital diagnoses and associated dates for more than 400,000 participants was published openly.
The Guardian demonstrated that the data is not as anonymous as it appears. A reporter was able to pinpoint a volunteer’s extensive hospital diagnosis records using only their month and year of birth and the details of a major surgery they had undergone, information that many people share in everyday conversation. Dr Luc Rocher, associate professor at the Oxford Internet Institute, told the paper that removing identifiers “often does not guarantee anonymity” and that knowing a person’s birthday and a specific medical event date might be sufficient to identify their record with high confidence. Once identified, that record could reveal psychiatric diagnoses, HIV test results, or histories of substance abuse.
Under UK GDPR, data is only truly anonymised if individuals cannot be identified “by any reasonably likely means.” With datasets of this size and richness, especially those containing full genome sequences, the question is not whether re-identification is theoretically possible but whether it is practically difficult enough to constitute meaningful protection. The governance gap in data security is widening as datasets grow larger and AI tools make cross-referencing easier. Privacy experts argue that UK Biobank’s approach, treating de-identification as a sufficient safeguard, is at odds with the reality that many people share fragments of their health information online, and in the age of large language models, those fragments can be reassembled.
A pattern, not an incident
The Alibaba listings are the most dramatic manifestation of a structural problem that UK Biobank has been managing, with limited success, for months. The March investigation revealed that data leaks had occurred dozens of times, driven by the tension between two competing imperatives: journals and funders increasingly require researchers to publish the code they use to analyse large datasets, and that code sometimes includes the data itself, or enough of it to be reconstructed. UK Biobank prohibits this, but enforcement has depended on discovering violations after the fact and issuing takedown notices.
The breach also fits a broader pattern of institutional data exposure across Europe, which IBM identified as the world’s most targeted region for cyberattacks, with the UK accounting for 27% of all attacks on the continent. The Synnovis ransomware attack in June 2024 disrupted pathology services across southeast London for weeks after the Qilin group published patient data from Guy’s and St Thomas’ and King’s College Hospital trusts on the dark web. The Advanced Software ransomware attack in August 2022 took down NHS 111 services. WannaCry in 2017 hit 80 NHS organisations. Each of those was a traditional cyberattack, an external adversary exploiting a technical vulnerability. The Biobank breach is different. The adversary was inside the system, credentialled and approved, and the vulnerability was the access model itself.
The geopolitical dimension
That the data appeared on a Chinese platform will inevitably sharpen the political response. The UK has spent the past five years progressively restricting Chinese technology involvement in critical infrastructure, from the Huawei 5G ban to the National Security and Investment Act’s powers over sensitive data acquisitions. In March 2024, the government accused China-linked actors of cyberattacks on the Electoral Commission and parliamentarians. Chinese state-sponsored hackers have targeted Western governments repeatedly, including a campaign the Dutch government publicly attributed to Beijing that compromised more than 20,000 systems.
Murray thanked the Chinese government “for the speed and seriousness with which they worked to help remove these listings,” a diplomatic formulation that acknowledged cooperation while sidestepping the question of how three Chinese research institutions came to violate their data-sharing agreements simultaneously. The minister did not name the institutions. The ICO said it is “making enquiries.” Whether this was opportunistic misconduct by individual researchers or something more coordinated is a question the investigation will need to answer.
What happens next
UK Biobank has temporarily suspended all access to its research platform and is developing an automated checking system to prevent de-identified participant data from being extracted in bulk, with a target of having the system operational by the end of 2026. The organisation is also implementing strict limits on the size of files that can be taken off the platform. Conor O’Neill, chief executive of cybersecurity firm OnSecurity, said the breach “is a reminder that data protection failures are rarely the result of malicious intent” and pointed to “a cultural gap between policy and practice” in how researchers handle sensitive data.
The vulnerability of public institutions to data theft is not new. But the Biobank case is distinctive because the data was not stolen in any conventional sense. It was given away, under contract, to researchers who broke the contract. The 500,000 volunteers who signed up between 2006 and 2010 consented to share their most intimate biological information for the advancement of medical science. They did not consent to have it listed for sale on a Chinese e-commerce site. The distinction between a hack and a breach of trust may be legally significant. For the people whose genomes are in that database, it is not.


