Best practices for peer-reviewing a data analysis

Writing good code is hard, writing a good analysis is harder. Peer-review is an essential tool to fight repetitive errors, omissions and more generally divulge knowledge. I found the use of a checklist to be invaluable to help me remember the most important things I should watch out for during a review. It’s far too easy to focus on few details and ignore others which might be catched (or not) in a successive round.

I don’t religiously apply every bullet point of the following checklist to every analysis, nor is this list complete; more items would have to be added depending on the language, framework, libraries, models, etc. used.

  • Is the question the analysis should answer clearly stated?
  • Is the best/fastest dataset that can answer the question being used?
  • Do the variables used measure the right thing (e.g. submission date vs activity date)?
  • Is a representative sample being used?
  • Are all data inputs checked (for the correct type, length, format, and range) and encoded?
  • Do outliers need to be filtered or treated differently?
  • Is seasonality being accounted for?
  • Is sufficient data being used to answer the question?
  • Are comparisons performed with hypotheses tests?
  • Are estimates bounded with confidence intervals?
  • Should the results be normalized?
  • If any statistical method is being used, are the assumptions of the model met?
  • Is correlation confused with causation?
  • Does each plot communicate an important piece of information or address a question of interest?
  • Are legends and axes labelled and do the they start from 0?
  • Is the analysis easily reproducible?
  • Does the code work, i.e. does it perform its intended function?
  • Is there a more efficient way to solve the problem, assuming performance matters?
  • Does the code read like prose?
  • Does the code conform to the agreed coding conventions?
  • Is there any redundant or duplicate code?
  • Is the code as modular as possible?
  • Can any global variables be replaced?
  • Is there any commented out code and can it be removed?
  • Is logging missing?
  • Can any of the code be replaced with library functions?
  • Can any debugging code be removed?
  • Where third-party utilities are used, are returning errors being caught?
  • Is any public API commented?
  • Is any unusual behavior or edge-case handling described?
  • Is there any incomplete code? If so, should it be removed or flagged with a suitable marker like ‘TODO’?
  • Is the code easily testable?
  • Do tests exist and do they actually test that the code is performing the intended functionality?

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s