Statistics Archives - KDD Analytics

If odds are not odd, what about odds ratios?

KDD — Mon, 28 Jun 2021 00:37:12 +0000

What are the odds of developing a brain tumor from long-term use of cell phones?

This is an evolving area of research. Some studies have found an association and others have not.

But two recent meta-analyses suggest that the odds are about 33 to 44% greater due to long-term cell phone usage.

Got your attention?

“But what does this do to my odds of developing a brain tumor?” you may ask.

Before we answer that, we need to explain how the meta-analyses derive this 33 to 44% figure. Which introduces us to odds ratios.

Case-control studies

Studies of the association between cell phone usage and brain tumor are typically case-control studies.

Such studies are retrospective, as opposed to prospective.[1] They combine a sample of patients (cases) already diagnosed with a brain tumor with a random sample of non-patients (controls) drawn from the general population. Study investigators match controls to each case based on key demographics such as sex, age, and region.

The studies then measure and test for the existence of an association between exposure (cell phone usage) and outcome (brain tumor).[2]

Typically, these case-control studies report their estimated effects, not in terms of odds, but in terms of odds ratios.

So, what is an odds ratio?

Odds ratios

An odds ratio is a measure of association strength. In this case, between cell phone usage and the diagnosis of a brain tumor.

As an example, we can use the results from one of the high-quality studies used in the meta-analyses mentioned above to show how odds ratios are calculated.[3]

The data shown in the following table are from a case-control study conducted in Sweden between 2000 and 2003.[4] The data are for long term cell phone usage (>= 10 years). The reference category is no cell phone usage.[5]

In an earlier article we learned that the odds of an event occurring are the number of events divided by the number of non-events.

Thus, the odds of a long-term cell phone user in this sample being diagnosed with a brain tumor is (16 / 232) or 0.069; about 1 to 14.

The odds of a non-cell phone user being diagnosed with a brain tumor is (18 / 674) or 0.027; about 1 to 37.

The odds ratio is simply the ratio of the two odds: (0.069 / 0.027) or 2.582.

So, the odds of a long-term cell phone user being diagnosed with a brain tumor are 2.582 times greater compared to a non-cell phone user.

Alternatively, this can be stated in terms of a % difference. The odds of a long-term cell phone user being diagnosed with a brain tumor are 158% greater compared to a non-cell phone user ((2.582 – 1) * 100).

That is a pretty large effect.[6]

Meta-studies

Now this is just one study. The two meta-studies alluded to above each combined the results of 7 different, high-quality studies.

They found that the overall odds (across the studies) of a long-term cell phone user (>= 10 years) being diagnosed with a brain tumor (any tumor type) are 33% and (with respect to glioma, a common type of tumor) 44% greater compared to a non-cell phone user.[7]

These meta-studies found no effect due to cell phone usage over a shorter period (i.e., < 10 years).

So, it appears that the risk, if it exists, is associated with long-term usage. Moreover, using a cell phone on the same side of the head is associated with 46% greater odds of developing a glioma on that side of the head.[8]

Odds of developing a brain tumor

So, back to our original question. What are the odds of developing a brain tumor from long term cell phone usage?

The odds of developing a brain tumor among the general population is very low to start with. Annual incidence in the US (2018) is 6.5 per 100,00 or 0.0065%. In terms of odds, this is about 1 to 15,000.

So, a 44% increase in the odds would mean 9.4 per 100,000 or about 1 to 10,000. Still quite low.[9]

As one researcher put it, “Your chance of being hurt by distracted driving because you’re using your cell phone wipes out the risk of getting cancer.”

However, in 2011 the World Health Organization’s International Agency for Research on Cancer (IARC) did classify cell phones as a Group 2B carcinogen (i.e., possibly causes cancer).

And there continues to be a healthy debate in both the statistical and public arenas.

Studies are continuing to be released which purportedly finding evidence that recent increasing rates in glioblastomas, an aggressive type of cancer, are tied to cell phone usage.

Skeptics argue that changes in WHO classification of what is considered a glioblastoma may be responsible for any uptick in brain tumor incidence. And that the large, increased risk reported by studies, like the meta-studies discussed above, are inconsistent with the historical trend in brain tumor incidence.[10]

As we said at the outset, this is an evolving area of research, with lots of issues to untangle.

One thing to keep in mind, though, is who is funding the research. A topic we will cover in a later article.

We have odds ratios to thank

Back to the main point of this article.

Odds facilitate the measurement of the relative likelihood of events. Epidemiological studies that are retrospective, commonly use the odds ratio as this relative measurement of association strength.

So, the next time you hear that your favorite dietary choice increases your chances of developing cancer, it is probably the result of that not-so-oddity, the odds ratio.

[1] Prospective cohort studies have also been used (i.e., studies which track subjects over time). See here for a summary of the advantages and disadvantages of retrospective and prospective studies.

[2] Exposure is determined by answers to a lengthy questionnaire. Hence, one of the criticisms levied against case-control studies is respondent recall bias. That is, whether respondents accurately recall their cell phone usage, particularly over a long period of time.

[3] Studies are graded on a quality scale considering such factors as selection of cases and controls, comparability of cases and controls based on study design, and proper assessment/measurement of exposure.

[4] The results shown in the table are taken from a meta-study which considered this Hardell et al (2006) study.

[5] As cell phone usage becomes more ubiquitous, and fewer people who have never used a cell phone are available in the population, the exposure will need to be increasingly measured in terms of levels/frequency of usage.

[6] The additional risk derived using an odds ratio is closely related to the concept of efficacy, which is derived directly from the concept of relative risk (ratio of probabilities). We covered efficacy in an earlier article. Epidemiologists typically use relative risk to measure association strength in prospective (cohort) studies; odds ratios in case-control studies.

[7] Meta-studies start with a larger number of studies. They then cull studies from the final sample for various reasons, such as data availability and the quality grade they receive.

[8] All these studies on brain tumors controlled for whether cell phones were being used next to users’ heads.

[9] See for US cancer incidence rates as of 2018.

[10] See also Geoffrey Kabat (2017).

The post If odds are not odd, what about odds ratios? appeared first on KDD Analytics.

Odds and probability…two sides of the same coin

KDD — Fri, 04 Jun 2021 16:59:03 +0000

What are the lifetime odds of dying from being hit by a meteorite?

1 in 1,600,000.

Yep, not very likely. You are much more likely to die from a dog attack (1 in 86,781) or from a lightning strike (1 in 138,849).

But why odds?

Why not express these likelihoods in terms of probabilities? Seems like a more natural way to express the chance of an event occurring, doesn’t it?

Odds, however, are commonly used to express event risk. And of course, the chances of winning a sporting event.

As we write, the odds of the Los Angeles Dodgers repeating as World Series champions in 2021 are 1 in 3.[1]

So, what are odds?

The number of times an event occurs divided by the number of times it does not occur.

In the case of a meteorite strike, for every person that dies, 1.6 million do not. In the case of the Dodgers, we would expect the Dodgers to win 1 World Series for every 3 they lose.

But this still begs the question, why odds and not probabilities?

Odds and probabilities

It is true that the probability of a low likelihood event is so small that stating it as a % requires a lot of zeros after the decimal (0.0000625% in the case of dying from a meteorite strike).

But that is not an insurmountable objection. For example, the risk of disease is often expressed in terms of rates per 100,000 to make the chances of low likelihood events easier to comprehend. Or we could state the probability of the non-event…not dying from a meteorite strike (99.994%).

A more important reason for using odds is that they facilitate multiplicative comparisons.

A simple example makes clear how probabilities can fall short.

Suppose the probability that Beth will go out to dinner this weekend is 75%. We cannot say, then, that the probability of Jose doing the same is 3 times that of Beth’s probability.

Why? Probabilities are constrained to lie between 0 and 1. And 3 * 0.75 > 1.0.

So, what do we do? Enter odds.

Odds are unconstrained

Odds are only bounded on the low end, by 0. Let’s return to Beth and Jose.

The odds of Beth going out to dinner are 3 or 3/1. Why 3/1?

Remember odds are the ratio of the events to non-events. Beth is 75% likely to go out. So, if she is faced with 4 opportunities to go out, she will do so 3 times. In other words, she will go out (event) 3 times for every time she stays home (non-event). 3 to 1.

Now, if Jose is 3 times as likely to go out as Beth, his odds are simply 3 * 3 or 9. Equivalently, we can express his odds as 9 to 1 or 9/1.

On the odds scale, odds can be 2, 10, 50 times greater…there is no upper limit. And this makes them very useful when we wish to compare the relative likelihood of events occurring.

Two sides of the same coin

It turns out that if we are still interested in the probability, we can easily derive it from the odds. Odds and probability are two sides of the same coin.

Odds (o) are related to probability (p) by the following:

o = p / (1 – p) = (probability of event / probability of non-event)

Rearranging we find the “other side of the coin” (for an event):

p = o / (1 + o) = (odds of event) / (1 + odds of event)

So, in the case of Beth and Jose we get:

The relationship between odds and probability is shown graphically below.

As the odds increase, the probability also increases but in a non-linear manner. As shown above, the probability “increases at a decreasing rate” and approaches 1.0 “asymptotically” (i.e., as the odds get very large, the probability approaches but never quite reaches 1.0).[2]

But any finite odds will map to a probability between 0 and 1.

Odds are preferred

When comparing the relative chances of events (or sports teams), odds are the preferred way of expressing how much more likely one event is over another. We can always derive the associated probability. But since odds are unconstrained, there is no issue with saying the Los Angeles Dodgers are 11 times as likely (as of May 31, 2021) to win the World Series in 2021 than the Chicago Cubs.

So, the next time someone tells you the odds of rain during your camping trip this weekend are 5 to 2, you might want to sleep in a tent.

[1] As of May 31, 2021, the reported odds of the Dodgers repeating are 3 to 1 or 3/1. In the betting world, this is referred to as fractional odds. The number on the left or numerator is typically the number of times the team is expected to lose. 3/1 yields an implied probably of losing 3 times out of 4 or 75%. Thus, the probability of the Dodgers repeating are (1 – 0.750) or 25%. Expressing as odds yields 1/3.

[2] In the limit, if the odds = infinity, then probability = 1.

The post Odds and probability…two sides of the same coin appeared first on KDD Analytics.

Curse of Big Data

KDD — Mon, 03 May 2021 11:31:59 +0000

“Big data.”

We checked in with Google search trends recently. Appears that “Big Data” has lost its luster search-wise…started trending down about 4 years ago.

Nowadays, everything is big data?

Implications of big data

However, this does not mean we should lose sight of certain statistical implications associated with being “big”. Yes, large amounts of data can help us estimate relationships (effects) with a high degree of precision.

And help us uncover low occurrence events such as the blood clotting cases associated with the Johnson & Johnson COVID-19 vaccine.

But massive amounts of data can reveal patterns that are not always meaningful or happen by chance.

Additionally, from a statistical inference perspective, with big data, even small, uninteresting effects can be statistically significant.

This has important implications for inferential conclusions about the associations we are studying.

And it does not take all that much data for this to happen.

Small clinical trial example

As an example, consider the following hypothetical results from a clinical trial of a “common” cold vaccine:

The table shows the number of subjects who had both a positive outcome (no infection) and negative outcome (infection) across the two types of treatment. A standard statistical test of association, the Pearson chi-squared, indicates we cannot say there is any difference in outcomes across the two treatment types.

That is, we cannot reject the “null” hypothesis of no association at the 95% level of confidence (i.e., X²= 0.024).

The strength of the association, or effect size, is obtained from the ratio of relative risks.

The probability of a vaccinated subject getting sick is (24 / 59) or 0.407 (40.7%) while that for the placebo group is (29 / 69) or 0.420 (42.0%).

So the relative risk ratio is (0.407 / 0.420) or 0.968.[1]

Thus, we would expect that when applied to the population, under the same conditions as the study, there would be 3.2% fewer infections among those who received the vaccine (i.e., (1 – 0.968) *100)).

This 3.2% is known as the efficacy rate of the vaccine.

The 95% confidence interval for the relative risk ratio is wide (i.e., 0.639 to 1.465) indicating a lack of precision in the point estimate of 0.968.

The study investigators conclude that the effect of the vaccine is neither statistically nor practically significant.

Aside from its statistical insignificance, an efficacy rate of just 3.2% is not nearly large enough to justify starting production of the vaccine.

Large clinical trial example

Contrast this with the following study results based on a much larger sample of 44,800 subjects:[2]

The Pearson chi-squared statistic (X²) is now 8.375. Thus, the hypothesis of no association can be rejected at the 95% level of confidence.

And the 95% confidence interval for the relative risk ratio is much narrower indicating a much higher level of precision (i.e., 0.947 to 0.990).[3]

The study investigators now conclude that there is a statistically significant association between receiving the vaccine and avoiding a cold infection (positive outcome).

But, the relative risk ratio of a positive outcome from receiving the vaccine is identical to that obtained from the smaller study, 0.968.

Implying the efficacy rate is also the same, 3.2%.

Practical vs statistical significance

What are we to make of this?

From the perspective of effect size, do the larger study results carry more weight simply because the hypothesis of no association can be rejected? Even though the practical significance has remained the same?

We can turn a very small, 3.2% effect into a statistically significant effect by simply increasing the sample size.

But does this change the practical significance of the 3.2%?

No.

If 3.2% was deemed by the study investigators to be practically insignificant, it remains practically insignificant. Despite the larger sample size and despite it now being statistically significant.[4]

A curse of data “bigness”

With a large enough sample, everything is statistically significant, even associations that are practically not significant or very interesting.

The implication is that rather than focusing on hypothesis testing as sample sizes increase, the focus should shift. Towards the size of the estimated effect, whether the estimated effect is “practically” important, and “sensitivity analysis” (i.e., how does the estimated effect change when control variables are added and dropped).[5]

Confidence intervals can and should play a role. But they will get narrower and narrower as sample sizes grow. And everything within the confidence interval could still be deemed not practically important.

In sum, as data get bigger (and it does not take massive amounts of data for this to be an issue), we need to guard against concluding that a small effect is practically significant just because the p-value is very small (i.e., the effect is statistically significant).

The curse of big data is still very much with us.

[1] A ratio of 1.0 would mean no difference in effect between the treatment types.

[2] As a point of comparison, the 2020 Moderna and Pfizer COVID-19 vaccines trials consisted of about 30,000 and 40,000 subjects.

[3] A more complicated technique is used to calculate confidence intervals for actual clinical trial results than used here, which typically result in wider intervals. For example, in 2020, Moderna reported an efficacy rate of 94.1% for its COVID-19 vaccine with a 95% confidence interval of 89.3% to 96.8%.

[4] Since the standard error of the relative risk ratio estimate is based on the cell counts in the contingency table, increasing the size of the sample lowers the standard error, making it more likely we can reject the null hypothesis at a given level of confidence.

[5] The paper Too Big to Fail presents a nice discussion of these issues. Additionally, the American Statistical Association released recommendations on the reporting of p-values.

The post Curse of Big Data appeared first on KDD Analytics.

Efficacy vs Effectiveness of the COVID Vaccines…”tomato, tomahto”?

KDD — Thu, 08 Apr 2021 18:15:48 +0000

You like potato and I like potahto
You like tomato and I like tomahto
Potato, potahto, tomato, tomahto
Let’s call the whole thing off

But oh, if we call the whole thing off
Then we must part
And oh, if we ever part
then that might break my heart

—Ira Gershwin

The eye-popping efficacy rates reported for the Moderna (94%), Pfizer (95%) and, to a lesser extent, the Johnson & Johnson (66%) COVID-19 vaccines have undoubtedly not escaped your attention.

But what is vaccine efficacy and how is it calculated? And how does it differ from vaccine effectiveness?

Moderna vaccine efficacy

First, consider efficacy. Using Moderna’s reported clinical trial results as an example, we see that it is a straightforward calculation.

Moderna reported results from it’s COVID-19 vaccine trial in November 2020. The results are shown below in a 2×2 “contingency” or “cross-tabulation” table. The columns show the number of subjects who were infected (or not); the rows show the number who received the vaccine (or the placebo). And the cells show the intersection of those two events.

Relative risk

The strength of the association, or the effect size, between receiving the vaccine and not getting infected is measured by the relative risk.

The probability or risk of a vaccinated subject being infected is 0.08%. That is, (11 / 14,134) or the expected number of events / sum of events and non-events. For a subject receiving the placebo, the probability of infection is higher at 1.31% (i.e., 185 / 14,073).

So, using the placebo group as the reference group, the relative risk is (11 / 14,134) / (185 / 14,073) or 0.059.[1]

In other words, the risk of a vaccinated person being infected is 94.1% lower compared to a subject who received the placebo (i.e., (1 – 0.059) * 100)).

It is this calculation of 94.1% that was reported by Moderna as the vaccine’s efficacy rate.[2]

Vaccine effectiveness

So, what about vaccine effectiveness? The term effectiveness refers to how the vaccine performs in the real world. Efficacy refers to how the vaccine performs under “optimal” conditions of a clinical trial.

Clinical trials are based on a sample of subjects who may not be fully representative of the general population (e.g., all comorbidities are not controlled for). In addition, the COVID strain that existed in the population during the clinical trial period may not be the same that occurs when the vaccine is released. Also, vaccine transportation, storage and delivery may differ from the more controlled environment of the clinical trial. Thus, the effectiveness of the vaccine may be different from what was found during the clinical trial.

Studies on COVID vaccine effectiveness

So, do we have any data yet on the real-world effectiveness of the COVID vaccines? It takes time to collect data, but we do have some indication that vaccine effectiveness is very high.

An early study appeared February 24, 2021 in the New England Journal of Medicine. The study examined the Pfizer vaccine performance in Israel. The sample was matched data from over 1 million people, half who were vaccinated between December 2020 to February 2021 and half who were not. The results of the study suggest a symptomatic infection effectiveness rate of 94% 7+ days after the second dose.

A more recent study released by the CDC on April 2 examined both the Pfizer and Moderna vaccines. This study used US data from December 2020 to March 2021. The sample consisted of 3,950 health care personnel, first responders, and other front-line workers. The study found that the vaccines were 90% effective against COVID infection 14+ days after the second dose. Even 14+ days after the first dose the vaccines were 80% effective.

As a point of comparison, according to the CDC, effectiveness of the annual flu vaccination ranges between 40 and 60%.[3]

So, the effectiveness rate, after 2 doses of the Pfizer and Moderna vaccines, appears to be very close in magnitude to the efficacy rate.

Very good news indeed!

Tomato, tomahto?

[1] A relative risk ratio of 1.0 would mean no difference in effect between the treatment types.

[2] A summary of efficacy rates across the range of current COVID vaccines can be found here.

[3] One reason for the range is that the flu strain that is in circulation can differ from what was predicted when the annual flu vaccine was developed earlier in the year.

The post Efficacy vs Effectiveness of the COVID Vaccines…”tomato, tomahto”? appeared first on KDD Analytics.