MPEG seminar on Genome Compression Standardization
A seminar on Genome Compression Standardization has been held on 20th October 2015 during the 113th MPEG meeting in Geneva. The purpose of the seminar is twofold, to raise the awareness on the need of new approaches to genome compression for the efficient handling of the increasing flood of sequencing data and to collect requirements from stakeholders from the different fields interested in the acquisition and processing of genome data.
The main topics covered by the seminar presentations were:
- New approaches, tools and algorithms to compress genome sequence data
- Objectives and issues of quality scores compression and impact on downstream analysis applications
- New approaches to quality scores definition and processing
- Genome compression and genomic medicine applications
MPEG-G members and anyone interested has been invited to join the seminar to learn more about genome data processing challenges and MPEG standardization activities in this area, to share opinions and to work with MPEG towards the definition of standard technologies supporting improved storage, transport and new functionality for the processing of genomic information.
20th October (Tuesday), 2015
Talks: 2pm – 6pm,
Crowne Plaza Hotel in Geneva
Avenue Louis-Casai 75, 1216 Cointrin, Switzerland
(see MPEG website for further details)
Marco Mattavelli (EPFL), Joern Ostermann (TNT), Ioannis Xenarios (SIB-SwissProt)
The workshop has featured the following oral presentations:
Ioannis Xenarios – Swiss Institute of Bioinformatics – SwissProt, CH
Need to prepare to get everybody sequenced in the future: from womb to tomb – MPEG the movie of your life
James Bonfield – Wellcome Trust Sanger Institute, UK
CRAM implementation & future directions
Guy Cochrane – European Bioinformatics Institute, UK
Sequence data compression in the wild
Noah Daniels – Massachusetts Institute of Technology, US
Entropy-scaling Search of Massive Biological Data
Lukasz Roguski – Centro Nacional de Análisis Genómico (CNAG), Centre for Genomic Regulation (CRG)/Barcelona Institute of Science and Technology (BIST), ES
Paolo Ribeca – Integrative Biology, The Pirbright Institute, Woking, UK
Flexible compressed storage of genomic information beyond file formats: our experience with CARGO
Daniel Greenfield – PetaGene, UK
Lossy compression of genomics datasets
Noah Daniels received his Ph.D. in computer science from Tufts University in 2013, supervised by Lenore Cowen. He is currently a postdoctoral researcher at MIT in Bonnie Berger’s group. His research interests focus on algorithms for biological data science, particularly compressive acceleration. He has also worked on algorithm development at several start-up companies in the fields of global trade and medical informatics.
Daniel Greenfield spent four years leading teams at startups in Silicon Valley architecting and building groundbreaking products in parallel computing and high performance networking, with a subsequent acquisition by NASDAQ-listed nVidia. He completed a MEng in Bioinformatics, followed by a PhD in Computer Science at the University of Cambridge as a Gates scholar. His PhD dissertation was awarded the 2011 BCS Distinguished Dissertation Prize, for the top Computer Science dissertation in the UK. He is currently director at Fonleap Ltd, working on storage optimisation technologies. Over the past year, he has worked on PetaGene, a project in collaboration with EMBL-EBI, exploring new approaches to improve genomics data storage, compression and analysis.
James Bonfield studied Computer Science at Warwick University and then started work in 1992 at Medical Research Council’s Laboratory of Molecular Biology in Cambridge working on the “Staden Package” – assembly, editing and visualization of DNA sequences. Since 2003he has worked at the Wellcome Trust Sanger Institute on similar work, more recently involved in developing the C implementation to CRAM.
Ioannis is the Director of Vital-IT Group in Lausanne as well as the Swiss-Prot Group in Geneva. He received a Ph.D in immunology at the Ludwig Institute of Cancer Research and the Institute of Biochemistry. He worked on the development of the Database of Interacting Proteins (DIP) under the supervision of Prof. David Eisenberg at the University of California Los Angeles. He then became the head of Translational Bioinformatics at Serono (now Merck Serono) where his group developed computational methodologies in the area of proteomics, microarray and genetics. He is one of the Principal investigators of the ENFIN project aiming at providing methods in dynamical systems modeling. Ioannis Xenarios is UNIL full Professor ad personam, affiliated with the CIG, since August 2010
Guy heads the European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena), a platform for the management, sharing, integration and dissemination of sequence data. With a background in cancer research and 12 years of experience in bioinformatics services, he has driven numerous developments both within and beyond ENA, notably leading the development of next generation sequence data infrastructure and comprehensive submission, archiving and presentation services in the late 2000s. Current work of the team includes data management and hosting for the MAP project, in which nanopore sequence data are captured, validated, archived and presented amongst the Oxford Nanopore Technologies user community. As part of the EMBL-EBI team that initially explored reference-based compression for sequence data, GC led algorithm and software development work on the CRAM framework (http://www.ebi.ac.uk/ena/software/cram-toolkit), which is now supported by the major downstream analysis tools and sees extensive use. GC is an active player in the sequence informatics community, is involved deeply in standards development work, established and coordinated the popular ‘Wellcome Trust Next Generation Hinxton Retreat’ series of annual workshops (2008-2013) and was editor of the Nucleic Acids Research Databases Collection (2008-2011). He has broad experience of leadership of groups of people in the effective delivery of scientific content, services and technology across core programmes of EMBL-EBI and numerous externally funded activities.
Lukasz is a software engineer, passionate about solving problems related with efficient processing of large volumes of data. He is currently performing research at National Centre for Genomic Analysis (CNAG) and Pompeu Fabra University (UPF) in Barcelona. His research interests focus on methods for High Throughput Sequencing data storage and compression.
Paolo Ribeca got his PhD in Theoretical Physics from University of Paris Sud (Orsay). His research interests focus on the application of cutting-edge techniques in high-performance scientific computing, data analysis and statistics to open research problems in physics, mathematics and biology. Since the inception of high-throughput genome sequencing techniques he specialized in algorithms for short-read processing, with emphasis on DNA/RNA alignment and assembly. He is the main architect of the GEM (Genomics Multi-tool) suite (see http://gemlibrary.sourceforge.net), which provides programs like the GEM mapper and the GEM RNA-mapping pipeline.