The Francis Crick Institute has hundreds of global partnerships that require controlled access to sensitive health data.
Its current CIO is James Fleming, who joined the research institute four years ago. Fleming, who has a degree in physics, comes from a telecoms background, having previously worked at BT on projects including the 4G backhaul network.
Asked about the similarities and differences between the telco provider and working in a medical research environment, Fleming says: “BT is a data-driven organisation. I spent most of my time working in the IT and applications layer, which are hugely data-driven. What is transferable is the infrastructure needed to do IT well. Disciplines like scalable technology and robust operational processes are directly transferable.”
Since he joined the Institute, Fleming says the IT department has been adapting rapidly and is now 40% larger. “We were 54 people when I first joined, but now our full complement is 88,” he says.
IT has responsibility over three tech platforms, the IT infrastructure, high-performance computing and three public clouds. Fleming says his role, and the role of IT, is to make the IT strategy as agile as possible.
Supporting global collaboration
Research conducted at the Francis Crick Institute is based on highly collaborative networks of teams of about 10 people each. There are more than 100 such groups internally. On top of this, the research also involves collaborating with 1,400 institutions across 90 countries, says Fleming.
From an IT perspective, the most immediate difficulty is adhering to various data protection and sovereignty laws, such as the General Data Protection Regulation (GDPR) in Europe and the Health Insurance Portability and Accountability Act (HIPAA) in the US, especially as there is no unified global standard for protecting patients’ data.
There are numerous approaches that IT leaders can take to provide data access, ranging from virtual private networks (VPNs) that control access to data residing on file servers to programmatic control through application programming interfaces (APIs). The required approach depends on the type of data and the granularity of control required.
In Fleming’s experience, API access is only valuable in situations where the data is well understood. But he adds: “The vast majority of research doesn’t fall into this category.” Instead, it is more important to provide access to a dataset, he says.
“Now a secure, auditable environment can be built in around 30 minutes”
James Fleming, Francis Crick Institute
The Francis Crick Institute decided to leave behind its on-premise data infrastructure and has used Snowflake’s Data Cloud to provide it with the precise access controls, management reporting, authentication, billing control and data-sharing capabilities it needed to host its data safely.
Explaining the change, Fleming says: “In scientific research, you used to have to log into a file store. Snowflake offers collaboration as a shared service, with real-time access control and workflow for auditing.”
Using Snowflake’s native capabilities, the Institute has been able to build a trusted research environment (TRE) architecture, which, according to Fleming, is rapidly configurable and deployable for each specific research project and supports consortium-based research across larger global networks.
He says that by using Snowflake, the IT function at the Francis Crick Institute can offer a more permeable environment to enable researchers to work within an ecosystem. “Developing the infrastructure for a complex global consortium used to take as a one to two-year undertaking,” he says. “Now a secure, auditable environment can be built in around 30 minutes.”
A repeatable framework for global researchers
In a paper published in August 2022 proposing a new approach to supporting collaborative research, Fleming noted that one of the fundamental administrative issues with managing a consortium is allocating and managing roles and responsibilities for handling, loading and transforming data.
Usually, there is no predefined framework for these, because they are largely dictated by whatever infrastructure is being used by the consortium, which in turn is often distributed across all the consortium members, who are trying to piece together a workable solution from whatever locally available components they can negotiate to use.
“Often, this therefore means relying on ‘best-efforts’ support from local administrators for different elements, leading to a lack of standardisation, and considerable complexity in ensuring that the terms of the research collaboration agreement and any ethics approvals are followed correctly,” said Fleming in the paper.
Rather than developing a traditional trust research environment, the team wanted to build something that was repeatable and could be rolled out for different research groups and research initiatives, says Fleming.
The Data and Analytics Research Environments UK (DARE UK) consortium led by the Francis Crick Institute, along with BT, the Institute of Cancer Research and the Rosalind Franklin Institute, supported by technical partners Infinite Lambda and Snowflake, worked on developing such a platform. The proposed platform is designed to support the creation of secure trusted research environments “on demand”, built around the needs and restrictions of individual research projects.
The architecture is based on a set of core components that include Snowflake for the data sources, Apache Airflow for workflow, DBeaver as the data extraction, translation and load component, Oktra provides user authentication and ServicesNow offers additional controls.
Discussing the process by which data access is provided to a research consortium, Fleming says: “A bunch of scientists form a consortium to ask a common research question. There is a formal agreement to handle intellectual property and briefings on what they want to answer.”
The outputs of this then go to research bodies such as hospital trusts, which assess the project’s validity and ethics, and approve grants. At the point where the consortium members require consistent patient data, a series of questionnaires breaks down their requirements, says Fleming. This feeds into a script that feeds into Snowflake to provide the required level of data protection, for instance, if HIPAA compliance has been stipulated.
Fleming says four clinical research projects are using the DARE UK framework, covering cancer, to Parkinson’s disease, to hepatitis B and schizophrenia. The projects are very different in structure – two have a complex international dimension, while two are “domestic” with UK collaborators only. The first project went live in late September 2022.