How we marked the quiz and additional resources
Take the quiz here, if you haven’t already, and then come back.
Have suggestions or thoughts? Send your comments through issues. Source code for this webpost here.
Additional resources below.
There are many data science software programs, each with its own strengths. Data scientists typically use multiple programs, but open software is the defacto way to share code with others.
The two leading software packages for data science are R and Python. Julia is also open-source with rapid adoption. There are thousands and thousands of code packages that are publicly available in code repositories for R and Python. These code respositories have well-established guidelines for how to write and document code for sharing. A rich tool kit of supporting libraries are available to make it easy for you to adopt these best practices, including creating documentation, cleaning and linting your code, checking and testing for errors.
Tip: Take an introduction course for an open-source program language.
Many introductory courses are free.
Code is easier to read if it is written in a consistent style that includes naming variables, functions and how to use spaces. There are common style guides for different languages or within institutions. For example, Google publishes their style guides for all major program languages. Try to use the same style guide as your organization or collegues.
Tip: Talk with your collegues about adopting a coding style.
Documenting your analysis code is more than making comments within the program code explaining the steps or sections. See blog post by Daniele Procida for a description of the four components of good documentation: tutorials, how-to guides, explanation and technical reference.
“It doesn’t matter how good your software is, because if the documentation is not good enough, people will not use it.”
Daniele Procida. What nobody tells you about documentation
Tip: Have a coding buddy. Use Git’s Pull Requests with each other to review each others code.
Tip: Think of documenting code like explaining to someone how to bake a cake.
Code notebooks like Juypter and rMarkdown are a great way to document and share code. They combine code snippets in a docment that is easy to read. The code snippets can be executed by your reader – and modified by them if they wish to explore how the code works. The notebooks are easily shared on the web.
We like rMarkdown because you can not only create notebooks but also online books, blogs, (code package documentation)[https://pkgdown.r-lib.org/] and other documents.
Errors in code are inevitable. The more code you write, the more eerors you’ll have in your code. Remember, your data science projects are becoming increasingly complex, with more and larger data and more collaborators.
Fortunately, there are tools to help you reduce the number of errors you write. In our teams, we check each others code if we plan to reuse it more than once or share it with others. We use linters, checkers and tests. As a back-up, we deploy program these tools using continuous integration before code is shared by others.
Tip: Set up a code linter.
Git and Git repositories have the most points in our test. When we talk about Git, terms like “linchpin” and “glue” come up. Many parts of the open science tool kit would look considerably less developed without Git. There are other version control systems but Git is really the only system that is used for new projects.
Git is difficult to learn, but everyone who learns it is glad they did. Using git will save you time and heartache – any investment in your time will be returned to you many, many fold.
Git repositories like GitHub and Gitlab are how data scientist colloborate across the world.
There are many guides and tips for using git and git repositories. Keep your eyes open for the increasing number of resources for people like you (hint: many resources are for software developers in business).
We have created some resources for health researchers. Getting started with Git.
Tip: If you don’t (yet) have a git repository at work, try it at home. There are even many uses for git beyond programming – make it fun.
You may have been suprised to see a question about metadata on the test. This is a rapidly developing area that data scientists are paying more attention to. Data without context is meaningless – metadata provide that context. Metadata tells you about your data: where it came from (data provance) and what it contains.
Fortunately, many sources of data come with standarized metadata. For example, there are over ten thousand databases with 5 million variables available at ICPSR, all with metadata encoded using the Data Documentation Initiative((DDI)[https://www.ddialliance.org]). Aim to publish your results with metadata to allow machine-actionable uses of your research, in addition to ensuring reliable reproduciblity and transparency. For example, we publish our algorihtms using Predictive Modelling Mark-up Language ((PMML)[http://dmg.org]).
A challenge is a lack of well-developed tools to use metadata in your project (beyond variable and category lables). We’ve created an R library that helps use and maintain metadata.
Tip: Add titles for tables that you share with others – in the same way you already add titles to figures and plots. Yes! Titles are metadata. You wouldn’t make a plot without a title, so why are sharing tables without titles!
“Academic institutions can and must do better. We should be taking multiple approaches to make science more reliable.”
Jeffrey Flier. Dean of Medicine, Harvard University. Nature 549, 133 (2017)
“Put simply, this means that researchers should make their computational workflow and data available for others to view. They should include the code used to generate published figures and omit only data that cannot be released for privacy or legal reasons.”
Jeffrey M. Perkel. A toolkit for data transparency takes shape. Nature 560, 513-515 (2018)
“More than 70% of researchers have tried and failed to reproduce another scientist’s experiments, and more than half have failed to reproduce their own experiments.”
Monya Baker. 1,500 scientists lift the lid on reproducibility. Nature 533, 452-4 (2016)