Data Scientists Should Care About DevOps
Jun 4, 2017 · 481 words · 3 minutes read
Many data scientists I know come from an academic background — they went to college and majored in statistics, math, or research heavy field. Some have masters in statistics or PhDs in psychology or economics. Most of them don’t work in software development but a lot of them care about reproducibility.
Reproducibility refers to the ability to reproduce an analysis. It’s a big topic in psychology and a lot of other fields right now. Unfortunately lots of good researchers can’t reproduce their own analyses and don’t provide enough information for their analyses can’t be reproduced by other people.
For research, you should make it stupid simple to reproduce your analysis. DevOps is about automated testing, continuous delivery, and infrastructure. Treat your analysis like software deployment and your work will be far easier to reproduce. These are a few of my tips for better reproducibility.
Do it with code.
Analyze your data using code. Learning R, Python, Julia, Matlab, or another programming language isn’t that difficult. It’ll take a little bit of time but pay off tremendously. Menu-based statistical software is especially bad because it’s hard to write good instructions to reproduce your analysis.
Write good code.
Okay this one is a little harder. Just because your code works doesn’t mean it’s good. Avoid hard coding file paths or other information that might be different on someone else’s computer. I can’t count how many times someone has shared their code with me with hard coded file paths or passwords! Your code should run on my computer, not just yours, and shouldn’t contain any sensitive information.
Automate your output.
Use RMarkdown (or something similar) to weave your code and report in a single document. Avoid copying and pasting output into Word documents. Nothing is worse than having to manually check and change every number in every table because your boss or a reviewer asked for a few ‘minor’ changes.
Write simple tests.
After exploring and analyzing your data, you develop expectations. Make your expectations explicit. If your analysis or data changes, your tests should fail. These can be a quick way to know if any tweaks to the analysis invalidate the text in your report.
Version control your code.
Put your code on GitHub, Bitbucket, or GitLab. Even if you’re the only person working on the project, version control helps you track changes. If you choose to open source your project, you might get some useful contributions or someone else might learn something from your code.
Consider containerization.
Nothing is like running code on your machine like running code on your machine. If your analysis is in a Docker container, it’s not only easy to share but you can be confident anyone who uses it will be using the same environment. There is a bit of a learning curve to Docker but it’s not so bad once you get the hang of it.