Data Science Engineer at Harvard University

A joint project between the SBGrid Consortium at Harvard Medical School and the Dataverse Team at the Institute for Quantitative Social Science at Harvard University has an immediate opening for a developer to help us build a next generation data publication system for large biomedical datasets.

We aim to make biomedical datasets publicly available through a federated data grid to facilitate access, citation, and data analysis by scientists. Our pilot collection includes datasets generated using X-ray crystallography, computer modeling, lattice light sheet microscopy, and microED diffraction. This collection is currently replicated to computing centers in the US, Europe, Asia, and South America. The project is supported by the Helmsley Charitable Trust and was recently selected as a pilot of the U.S. National Data Service. To learn more about the environment, please visit our current implementation at and our group websites at,, and

The data science engineer will be embedded within the Dataverse development team and will primarily be focused on implementing the features necessary for the successful completion of this project. Examples of features that must be added to Dataverse include implementation of APIs for interoperation with components for large (~100 GB) datasets, automatic data validation pipelines, custom publishing workflows, and other features relevant to specific biomedical data types. All new functionality developed under this project will be merged into the Dataverse open source project and shared with the community.

As a member of our team, this person can expect to collaborate with researchers, collection specialists, and present outcomes of the project at meetings and conferences.

Advanced degree (computer science, bioinformatics or engineering preferred) and 3-5 years of strong programming experience is strongly preferred, preferably in Java and Python, ideally in the context of web applications.

Our team will welcome candidates with diverse technical backgrounds, but the successful candidate will have experience handling large datasets and working as a part of an agile software development team. A working knowledge of Linux, shell scripting, databases, and distributed version control systems (git, mercurial, etc) is also necessary. The ideal candidate will also be familiar with data management software and the handling and analysis of large datasets.

This is a term appointment ending on September 30, 2018. To apply, e mail

