Data mining South Carolina data suggests environmental role in PD Posted to NeuroTalk 23rd July 2012 http://neurotalk.psychcentral.com/thread173642.html -------------------------------------------------------------------------------- Summary Data mining prevalence data from South Carolina suggests that environmental factors play a part in the prevalence of PD and its progression. Questions addressed in this post Using publicly available Parkinson's Disease epidemiological data, can we throw some light on the following questions: Do environmental factors play a part in the prevalence of PD? Do environmental factors play a part in the progression of PD after diagnosis? Scale The analysis is based on one state and on one administrative division, the county. This limits the issues that can be detected. The granularity is too large to detect, for instance, sick building syndrome, and the extent is probably too small to detect the impact of large scale features, such as UV levels. In short, this analysis is blind to the impact of many factors. However, the US county is an appropriate size, both in term of area and population, to capture many environmental features, such as large scale pollution sources, like motorways. There is no doubt that observed PD prevalence rates vary from county to county. The issue is to distill the significant variations from the random fluctuations, the noise. Prevalence How can we assess whether there is an association between the environment and PD? If we have county environmental data, such as ozone levels, we could find the correlation between this and the county prevalence rates. But, even if we don't have environmental data, we can still make progress: we could search for correlations between counties, expecting a higher correlation between neighbouring counties, because they have a more shared environment. An alternative approach is, for each county, to randomly divide people into two groups, and report the prevalence rate for each group. Then, if environmental factors play no part, we'd expect the correlation across all counties and both groups to be 0. Note, this approach does not require environmental data to test against. Its downside is that, while it can tell that non-random fluctuations are occuring, it does not identify what the possible cause is. Unfortunately, we can't do this statistical experiment because we don't have access to the raw data. However, we can get a proxy for it by using the data segmented by group. Progression Unfortunately, I don't have access to progression rate data. To make progress we need to derive a proxy measure. Everything else being equal, to the extent that Parkinson's affects mortality rates, a faster progression will lead to an earlier death. For something without a cure, the time between diagnosis and death is the duration of the disease. Unfortunately, I don't have access to duration data either. We need to look for a proxy. We note that the duration equals the prevalence divided by the incidence. We don't have access to incidence data. We need to look for a proxy. Finally, using a what comes in must go out argument, we note that if the prevalence of the disease is not changing, the incidence equals the mortality rate. Mortality rate data is available. The problem is that it under reports PD, leading to implied durations, in some cases in excess of 40 years. Fortunately, for our purposes here, the absolute duration is not important. What matters is the relative duration. For this to be useful, consistency in reporting is required. We can continue by using the two group method described above. Weaknesses in the analysis 1. There may be different reporting rates. (This implies the real correlation between PD and environmental factors is lower.) 2. There may be family genetic clusters not smoothed out at the county level. (lower) 3. The scale of the environmental factors may be higher or lower than that captured at the county level. (higher) 4. The two groups may not be evenly mixed across the county. (higher.) 5. Long duration can occur for good reasons, e.g. better treatment, or bad reasons, e.g. early onset. (lower) 6. Sample sizes are low. 7. Some counties may provide better health care. (lower) 8. Data inconsistencies. Parkinson's Disease is an under reported cause of death. The CDC Wonder tool suppresses results where there are less than 10 data points. Therefore, to get enough results, I used all 11 years' worth of data. So, the duration estimates are based on different years than the prevalence results. (higher) Data Sources Prevalence data has come from a report by Forti et al.[1]. They used UB-92 billing data for the period 1996-2000 for "a diagnosis of ICD-9 code 332.0 (paralysis agitans or idiopathic Parkinsons disease) or 332.1 (secondary parkinsonism)". They give results for both "whites" and "African-Americans". Mortality rates have been obtained using the CDC Wonder database front end on the "Multiple Cause of Death, 1999-2009" data set [2], searching on ICD-10 code G20 (Parkinson's disease). The data used was for "White" and to approximate to "Black or African American", not white. Results Prevalence. Based on the 46 South Carolina counties the correlation between the two groups is 0.36 (n=46, CI= [0.07, 0.58]). Taking the square of the correlation, indicates that approximately 10% of the variance in the prevalence is related to county level differences. Duration. Based on the 42 South Carolina counties for which the data is available (to maintain confidentiality the CDC suppresses county results where there are less than 10 reports) the correlation between the two groups is 0.52, (n=42, CI= [0.26, 0.71]), which indicates that approximately 25% of the variance in the duration is related to county level differences. Conclusions The data from South Carolina suggests that environmental factors at the county scale play a part in the prevalence and progression of PD. More work is required to confirm these results. I will be grateful for your comments. I would like to hear from anyone with local knowlege who could suggest specific reasons for the differences. References [1] "Parkinson's Outreach and Education Training (POET) Planning Grant, Final Report" Forti E., Bergmann K., Salak V., Wall K., Fleming T., South Carolina Geriatric Education Centre, Nov 15, 2003. http://coa.kumc.edu/gecresource/samp...sonsReport.pdf [2] http://wonder.cdc.gov/mcd-icd10.html John __________________