How to review a data analysis

18 Jul, 2016

Writing good code is hard, writing a good analysis is harder. Peer-review is an essential tool to fight repetitive errors, omissions and more generally divulge knowledge. I found the use of a checklist to be invaluable to help me remember the most important things I should watch out for during a review. It’s far too easy to focus on few details and ignore others which might be catched (or not) in a successive round.

I don’t religiously apply every bullet point of the following checklist to every analysis, nor is this list complete; more items would have to be added depending on the language, framework, libraries, models, etc. used.

Is the question the analysis should answer clearly stated?
Is the best/fastest dataset that can answer the question being used?
Do the variables used measure the right thing (e.g. submission date vs activity date)?
Is a representative sample being used?
Are all data inputs checked (for the correct type, length, format, and range) and encoded?
Do outliers need to be filtered or treated differently?
Is seasonality being accounted for?
Is sufficient data being used to answer the question?
Are comparisons performed with hypotheses tests?
Are estimates bounded with confidence intervals?
Should the results be normalized?
If any statistical method is being used, are the assumptions of the model met?
Is correlation confused with causation?
Does each plot communicate an important piece of information or address a question of interest?
Are legends and axes labelled and do the they start from 0?
Is the analysis easily reproducible?
Does the code work, i.e. does it perform its intended function?
Is there a more efficient way to solve the problem, assuming performance matters?
Does the code read like prose?
Does the code conform to the agreed coding conventions?
Is there any redundant or duplicate code?
Is the code as modular as possible?
Can any global variables be replaced?
Is there any commented out code and can it be removed?
Is logging missing?
Can any of the code be replaced with library functions?
Can any debugging code be removed?
Where third-party utilities are used, are returning errors being caught?
Is any public API commented?
Is any unusual behavior or edge-case handling described?
Is there any incomplete code? If so, should it be removed or flagged with a suitable marker like ‘TODO’?
Is the code easily testable?
Do tests exist and do they actually test that the code is performing the intended functionality?

#Statistics #Data Analysis