The quiz - Are you using the data science tool kit for open science?

How we marked the quiz and additional resources

Take the quiz here, if you haven’t already, and then come back.

Explantion of how we marked the quiz

Have suggestions or thoughts? Send your comments through issues. Source code for this webpost here.

Additional resources below.

1) Open software

There are many data science software programs, each with its own strengths. Data scientists typically use multiple programs, but open software is the defacto way to share code with others.

The two leading software packages for data science are R and Python. Julia is also open-source with rapid adoption. There are thousands and thousands of code packages that are publicly available in code repositories for R and Python. These code respositories have well-established guidelines for how to write and document code for sharing. A rich tool kit of supporting libraries are available to make it easy for you to adopt these best practices, including creating documentation, cleaning and linting your code, checking and testing for errors.

Tip: Take an introduction course for an open-source program language.

Many introductory courses are free.

2) Coding style

Code is easier to read if it is written in a consistent style that includes naming variables, functions and how to use spaces. There are common style guides for different languages or within institutions. For example, Google publishes their style guides for all major program languages. Try to use the same style guide as your organization or collegues.

Tip: Talk with your collegues about adopting a coding style.

3) Documenting your code

Documenting your analysis code is more than making comments within the program code explaining the steps or sections. See blog post by Daniele Procida for a description of the four components of good documentation: tutorials, how-to guides, explanation and technical reference.

“It doesn’t matter how good your software is, because if the documentation is not good enough, people will not use it.”

Daniele Procida. What nobody tells you about documentation

Tip: Have a coding buddy. Use Git’s Pull Requests with each other to review each others code.

Tip: Think of documenting code like explaining to someone how to bake a cake.

4) Code notebooks

Code notebooks like Juypter and rMarkdown are a great way to document and share code. They combine code snippets in a docment that is easy to read. The code snippets can be executed by your reader – and modified by them if they wish to explore how the code works. The notebooks are easily shared on the web.

We like rMarkdown because you can not only create notebooks but also online books, blogs, (code package documentation)[https://pkgdown.r-lib.org/] and other documents.

5) Check your code

Errors in code are inevitable. The more code you write, the more eerors you’ll have in your code. Remember, your data science projects are becoming increasingly complex, with more and larger data and more collaborators.

Fortunately, there are tools to help you reduce the number of errors you write. In our teams, we check each others code if we plan to reuse it more than once or share it with others. We use linters, checkers and tests. As a back-up, we deploy program these tools using continuous integration before code is shared by others.

Tip: Set up a code linter.

6 to 8) Git and Git repositories

Git and Git repositories have the most points in our test. When we talk about Git, terms like “linchpin” and “glue” come up. Many parts of the open science tool kit would look considerably less developed without Git. There are other version control systems but Git is really the only system that is used for new projects.

Git is difficult to learn, but everyone who learns it is glad they did. Using git will save you time and heartache – any investment in your time will be returned to you many, many fold.

Git repositories like GitHub and Gitlab are how data scientist colloborate across the world.

There are many guides and tips for using git and git repositories. Keep your eyes open for the increasing number of resources for people like you (hint: many resources are for software developers in business).

We have created some resources for health researchers. Getting started with Git.

Tip: If you don’t (yet) have a git repository at work, try it at home. There are even many uses for git beyond programming – make it fun.

9) Metadata

You may have been suprised to see a question about metadata on the test. This is a rapidly developing area that data scientists are paying more attention to. Data without context is meaningless – metadata provide that context. Metadata tells you about your data: where it came from (data provance) and what it contains.

Fortunately, many sources of data come with standarized metadata. For example, there are over ten thousand databases with 5 million variables available at ICPSR, all with metadata encoded using the Data Documentation Initiative((DDI)[https://www.ddialliance.org]). Aim to publish your results with metadata to allow machine-actionable uses of your research, in addition to ensuring reliable reproduciblity and transparency. For example, we publish our algorihtms using Predictive Modelling Mark-up Language ((PMML)[http://dmg.org]).

A challenge is a lack of well-developed tools to use metadata in your project (beyond variable and category lables). We’ve created an R library that helps use and maintain metadata.

Tip: Add titles for tables that you share with others – in the same way you already add titles to figures and plots. Yes! Titles are metadata. You wouldn’t make a plot without a title, so why are sharing tables without titles!

Imperitives

Internationally, there is a growing voice of concern about research reproducitibity

“Academic institutions can and must do better. We should be taking multiple approaches to make science more reliable.”

Jeffrey Flier. Dean of Medicine, Harvard University. Nature 549, 133 (2017)

“Put simply, this means that researchers should make their computational workflow and data available for others to view. They should include the code used to generate published figures and omit only data that cannot be released for privacy or legal reasons.”

Jeffrey M. Perkel. A toolkit for data transparency takes shape. Nature 560, 513-515 (2018)

“More than 70% of researchers have tried and failed to reproduce another scientist’s experiments, and more than half have failed to reproduce their own experiments.”

Monya Baker. 1,500 scientists lift the lid on reproducibility. Nature 533, 452-4 (2016)

References

General

  1. Donoho D, 50 years of Data Science. Sept. 18, 2015
  2. Stukel TA, Austin PC, Azimaee M, Bronskill SE, Guttmann A, Paterson JM, Schull MJ, Sutradhar R, Victor JC. Envisioning a Data Science Strategy for ICES. Toronto, ON: Institute for Clinical Evaluative Sciences; 2017. ISBN: 978-1-926850-77-1
  3. Rumsfeld JS, Joynt KE, Maddox TM. Big data analytics to improve cardiovascular care: promise and challenges. Nature reviews Cardiology. 2016;13(6):350-9.
  4. Wilson G, Aruliah DA, Brown CT, Chue Hong NP, Davis M, Guy RT, et al. Best practices for scientific computing. PLoS Biol. 2014;12(1):e1001745.
  5. Hicks SC, Irizarry RA. A Guide to Teaching Data Science. The American Statistician. 2017;72(4):382-91. 10.1080/00031305.2017.1356747

Open Science

  1. Flier, J. (2017). Faculty promotion must assess reproducibility. Nature, 549(7671), 133. doi:10.1038/549133
  2. Perkel, J. M. (2018). A toolkit for data transparency takes shape. Nature, 560, 513-515.
  3. Baker, M. 1,500 scientists lift the lid on reproducibility. [Nature 533, 452-4 (2016)](https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.199700.
  4. Woelfle, M.; Olliaro, P.; Todd, M. H. (2011). Open science is a research accelerator. Nature Chemistry. 3: 745–748. doi:10.1038/nchem.1149
  5. Stodden, V., McNutt, M., Bailey, D. H., Deelman, E., Gil, Y., Hanson, B., . . . Taufer, M. (2016). Enhancing reproducibility for computational methods. Science, 354(6317), 1240-1241. doi:10.1126/science.aah6168
  6. Kopt D. This year’s Nobel Prize in economics was awarded to a Python convert. qz.com Oct 2018.
  7. Somers J. The Scientific Paper Is Obsolete: Here’s what’s next. The Atlantic Apr 2018.
  8. Kitzes J, Turek D, Deniz F. The practice of reproducible research: case studies and lessons from the data-intensive sciences. Univ of California Press; 2017.
  9. Pioneering ‘live-code’ article allows scientists to play with each other’s results. Nature

Git and version control

  1. 4 Reasons Why Beginning Programmers Should Use “Git”. Medium, Bouasavanh H, Jan 2018. Accessed May 2019 1 Perez-Riverol Y, Gatto L, Wang R, Sachsenberg T, Uszkoreit J, Leprevost Fda V, et al. Ten Simple Rules for Taking Advantage of Git and GitHub. PLoS Comput Biol. 2016;12(7):e1004947.
  2. Git/Github guide
  3. Version control with Git
  4. Git and GitHub learning resources
  5. Integration of GitHub with SAS
  6. Gitkraken (the Git client our team uses)

Code documentation

  1. What nobody tells you about documentation. Divio Blog. Accessed Nov 2018
  2. Jupyter Notebooks
  3. Why Jupyter is data scientist’ computational notebook of choice
  4. Introduction to R Markdown
  5. R Markdown: The definitive guide
  6. R Markdown cheat sheet
  7. Advantages to using R Markdown for data analysis over Jupyter Notebooks

Programming

  1. Population Health Data Science with R. Tomas J Argon
  2. R for Data Science. G Grolemund and H Wickham
  3. Efficient R programming. C Gillespie, R Lovelace
  4. R for Data Science- Chapter 19: Functions. G Grolemund, H Wickham

Metadata

  1. IBM developerWorks. What is PMML? Accessed 2018.