OpenData4Health

We first intended to do CancerRisks=f(geographic factors) in France. But with open data, all-cause mortality data is much more granular: 35,000 places in France! So we analyze Mortality=f(X) and derive insights for a better heath such as CancerRisks=f'(X)

OpenData4Health

What follows is long to read as it is a summarized current output of the project.

To spare time, go to the website and follow the instructions!

1. The project started as CancerRisks=f(geographic factors) in France, based on open data, but...

It rapidly became clear that French open data on cancer risks is not granular enough. Here is for example a spurious correlation when doing BreastCancerMortality=f(Radon) at department level (code in Python):

With individual data, we may still have the same type of relationship, but we could do a multidimensional analysis to take into account important factors such as smoking status, obesity and social-status and see that radon is a small added risk (one doesn't expect it to be a positive factor! 😄). But with open data at French department level we can't, because cancer risks are not available at many geographic points: just the number of departments - not enough.

Of note, cancer mortality is a better indicator for risk factors than cancer incidence or survival. This, because better cancer screenings can artificially increase incidence (that reflects medical diagnosis, not biological appearance). Being concerned by overall health, mortality therefore appeared the best indicator out of the three.

2. A switch to all-cause mortality

Switch! : in France, every death is individually reported, along with the location at birth and at death, the gender and precise age. So, for all-cause mortality - not cancer mortality - it becomes possible to do Y=f(X) at a much more granular level. There are biases that prevent us from using the finest granularity, for example most persons die at hospital and not at home, but we can still work at a much finer grid than departments, and therefore do multivariate analyses (We started collecting and making deaths statics by department with R)

After much effort, we could create maps of over and under mortality that make sense in France. Appart finding and assembling the right data, a blocking step was to highlight places with a statistically high or low mortality

Notice how the green and red areas are similar! However one map is based on 3800 cantons, one on 35000 cities. This is due to the use of statistics - non-statistical values are greyed. Models doing Y=f(X) [mortality = f(geographic factors)] are then much better than our eyes, but at least these images give us the big picture.

Of note, these points are controlled by age and gender. The differences come from other factors.

> Why is mortality low here and high there?

3. Does it correspond to a certain specific geographic environment?

Here, we tried to gather maps of geographic risk factors than might explain low or high mortality. WARNING - the following maps were found in various places online, we did not fact-check them. For each of them we put a link to where we found them. A further step is to gather precise data, but we thought that this visual analysis can help appreciate the big picture of what we are trying to do, what difficulties we will face, etc.

Air pollution (we have a lot more graphs here and in further sections, but it makes this web page slow)
Water pollution
Access to healthcare
Light pollution?
Behaviors
Wealth
Other?
Major pollutions

4. Is it linked to a particularly low/high mortality for some specific diseases? notably cancer?

Cancer
While we did not find cancer data at smaller granularity than departments... we did find images at smaller granularity than departments! Namely, at "canton" and "zone d'emploi" levels. Some of them for France or France Metropolitaine, others for specific regions. From the images, we can extract the corresponding data!
Other diseases
The CepiDC website provides mortality data per disease at department, age, gender and year granularity. This is not a very thin geographical granularity, but for other diseases than cancer this is the best source of data we found so far

5. Special analyses

Maps and our eyes are very powerful to lead us to the right tracks

Models are neded for more accuracy. Mortality = f(geographical factors)

Example: A graph BreastCancerMortality = f(AllCauseMortality-BreastCancerMortality) at department level (this is to be done for each body location, breast is an example) shows that:

<<graph to create! as part of the project; BreastCancerMortality = f(AllCauseMortality), at fine granularity>>

the two are grossly aligned. This indicates a strong common group of risk factors between breast cancer and most other material health conditions.
for some cancer types, some specific departments are not aligned. Why? Are there specific factors for these specific cancers that are stronger in these departments? This can be studied and interpreted on a case-by-case basis

Appendix

Answer 1, based on general knowledge. Cancer is mostly an ageing-related disease and many risk factors for cancer are common with pathologies that cause death in our society. This is true for prevention too: it is now general knowledge that smoking, obesity and pollution are bad, regular physical activity and eating well is good - for cancer as as well as cardiovascular and brain health. So, by searching for geographical factors associated with all-cause mortality, to a large extent it is expected that the risk factors make sense for cancer, and health in general.

Age is the single most important factor for cancer risks and yet the curves above do not make age appear! The reason is that we used aged-standardized rates based on the knowledge of the precise age-distribution. We can probably similarly (this is exploratory at this stage, via another technical approach) try to rectify departmental cancer rates based on the knowledge of the precise distribution of socioeconomic status in the department (and knowing the precise impact on mortality rates at national level) in order to "remove" socioeconomic risk factors and better see complementary risk factors like radon.

French all-cause mortality tables are available based on socio-economic status, wealth, occupation (Insee statistics at France level) [perhaps also obesity and smoking status]. See how it affects mortality (done in Excel):

Based on the socioeconomic caracteristics of the local population, it is possible to build theoretical mortality rates by mixing these tables, and to express them as a ratio of the general population mortality: this ratio is expected be lower than one for Paris for example.

If we make the assumption that, at first order, socioeconomic factors act similarly on cancer-mortality and all-cause mortality (for cancers with mass-screenings it might not, as higher socioeconomic status tend to get screenings more often), one could correct cancer mortality rates for the specific type of population at department level by dividing by the ratio. Then, it should be possible to do CorrectedBreastCancerMortality = f(radon) at department level without so much interference by first-order risk factors.

The validity of ratios and their use could be tested with all-cause mortality, at fine granularity as well as department level.

What we need to do

Create the missing graphs at department level. This requires to scrap mortality by department and create the graphs (R or python is welcome)
Create the missing graph at fine level. This requires to be at ease with the Insee individual deaths report. Anyone? (on data.gouv.fr I found an R program to clean the data and adjusted it to make it work, but didn't investigate the data yet)
Explore what we have at hand, prioritising risk water and air carcinogenic agents to synergize with the NEOS Epidemium project (concrete description on googledrive>ODE>"Neos, come and help!"). Also, as we can see above in the last graph, age has an ENORMOUS impact on cancer risk even compared to socio-economic factors. Specific analyses may be conducted on the matter
Get more fine-level data (I started to gather a list based on personal googling + exchanges with various persons at Epidemium), prioritising risk factors of the NEOS project
Explore further.
Someone to animate, take people by hand, create a community (I would be happy if others can help me on this, it takes time). Possibly, this can become a worldwide community that can grow before or after mid-January, it doesn't matter.

Additional information

Short Name: #OpenData4Health
Created on: November 11, 2021
Last update: December 12, 2021
Looking for collaborators: ✅

Keywords

machine learning

Webscrapping

Data engineering

Associated SDGs

Good Health and Well-being