Rare disease genetics has been revolutionized by the widespread use of exome sequencing. However, the interpretation of sequence data on a large scale raises two major issues. The first issue is technical, related to the limits of short read sequencing, and the artifacts that these tools can generate. The second limitation is the difficulty to interpret rare variants without the knowledge of the genetic variation in large cohorts of individuals.
These issues can be addressed to a large extent by sharing of clinical sequence data across multiple groups. This allows the identification of shared traits and variants, and the verification of hypotheses by comparing clinical data. Another advantage is the improvement in calling accuracy when the same variants are detected across multiple samples.
To this end, we have assembled the UCL-exomes (or UCLex) consortium. UCLex groups together a dozen clinical investigators (most at UCL but not only) with an analytical lead based at the UGI. Rules are in place to facilitate data sharing while respecting patient privacy. As of now, UCLex consists of 6053 exome samples and this set grows regularly. Data are stored on the UCL computer science cluster and analysis, including a monthly joint calling of the entire set, is centralized at the UCL Genetics Consortium. Owing to an average of 10 Gb size per exome, these sequence data add up to 60.53 Tb of required storage. This analysis creates significant computational challenges, and technical solutions (hardware/storage) are being developed as the sample size grows to accommodate this analysis. A key tool is the reduced reads format developed by the GATK team.
UCLex is a collaboration between clinical groups and analysts to facilitate the analysis of the large amount of exome sequence data being generated. The difficulty is to strike a balance between data confidentiality, data ownership but also the advantages associated with sharing data across multiple groups to facilitate variant calling, case control analysis and generally speaking the dissection of both complex and Mendelian disorders.
Clinical PIs are typically the “owners” of a exome sequence data and analysts at UCL provide calls/association tests back to the clinical groups. The general rules of engagement are listed below: