OpenData4Health
OpenData4Health
What follows is long to read as it is a summarized current output of the project.
To spare time, go to the website and follow the instructions!
1. The project started as CancerRisks=f(geographic factors) in France, based on open data, but...
It rapidly became clear that French open data on cancer risks is not granular enough. Here is for example a spurious correlation when doing BreastCancerMortality=f(Radon) at department level (code in Python):
With individual data, we may still have the same type of relationship, but we could do a multidimensional analysis to take into account important factors such as smoking status, obesity and social-status and see that radon is a small added risk (one doesn't expect it to be a positive factor! 😄). But with open data at French department level we can't, because cancer risks are not available at many geographic points: just the number of departments - not enough.
Of note, cancer mortality is a better indicator for risk factors than cancer incidence or survival. This, because better cancer screenings can artificially increase incidence (that reflects medical diagnosis, not biological appearance). Being concerned by overall health, mortality therefore appeared the best indicator out of the three.
2. A switch to all-cause mortality
Switch! : in France, every death is individually reported, along with the location at birth and at death, the gender and precise age. So, for all-cause mortality - not cancer mortality - it becomes possible to do Y=f(X) at a much more granular level. There are biases that prevent us from using the finest granularity, for example most persons die at hospital and not at home, but we can still work at a much finer grid than departments, and therefore do multivariate analyses (We started collecting and making deaths statics by department with R)
After much effort, we could create maps of over and under mortality that make sense in France. Appart finding and assembling the right data, a blocking step was to highlight places with a statistically high or low mortality