statistics Archives - KDD Analytics

Curse of Big Data

KDD — Mon, 03 May 2021 11:31:59 +0000

“Big data.”

We checked in with Google search trends recently. Appears that “Big Data” has lost its luster search-wise…started trending down about 4 years ago.

Nowadays, everything is big data?

Implications of big data

However, this does not mean we should lose sight of certain statistical implications associated with being “big”. Yes, large amounts of data can help us estimate relationships (effects) with a high degree of precision.

And help us uncover low occurrence events such as the blood clotting cases associated with the Johnson & Johnson COVID-19 vaccine.

But massive amounts of data can reveal patterns that are not always meaningful or happen by chance.

Additionally, from a statistical inference perspective, with big data, even small, uninteresting effects can be statistically significant.

This has important implications for inferential conclusions about the associations we are studying.

And it does not take all that much data for this to happen.

Small clinical trial example

As an example, consider the following hypothetical results from a clinical trial of a “common” cold vaccine:

The table shows the number of subjects who had both a positive outcome (no infection) and negative outcome (infection) across the two types of treatment. A standard statistical test of association, the Pearson chi-squared, indicates we cannot say there is any difference in outcomes across the two treatment types.

That is, we cannot reject the “null” hypothesis of no association at the 95% level of confidence (i.e., X²= 0.024).

The strength of the association, or effect size, is obtained from the ratio of relative risks.

The probability of a vaccinated subject getting sick is (24 / 59) or 0.407 (40.7%) while that for the placebo group is (29 / 69) or 0.420 (42.0%).

So the relative risk ratio is (0.407 / 0.420) or 0.968.[1]

Thus, we would expect that when applied to the population, under the same conditions as the study, there would be 3.2% fewer infections among those who received the vaccine (i.e., (1 – 0.968) *100)).

This 3.2% is known as the efficacy rate of the vaccine.

The 95% confidence interval for the relative risk ratio is wide (i.e., 0.639 to 1.465) indicating a lack of precision in the point estimate of 0.968.

The study investigators conclude that the effect of the vaccine is neither statistically nor practically significant.

Aside from its statistical insignificance, an efficacy rate of just 3.2% is not nearly large enough to justify starting production of the vaccine.

Large clinical trial example

Contrast this with the following study results based on a much larger sample of 44,800 subjects:[2]

The Pearson chi-squared statistic (X²) is now 8.375. Thus, the hypothesis of no association can be rejected at the 95% level of confidence.

And the 95% confidence interval for the relative risk ratio is much narrower indicating a much higher level of precision (i.e., 0.947 to 0.990).[3]

The study investigators now conclude that there is a statistically significant association between receiving the vaccine and avoiding a cold infection (positive outcome).

But, the relative risk ratio of a positive outcome from receiving the vaccine is identical to that obtained from the smaller study, 0.968.

Implying the efficacy rate is also the same, 3.2%.

Practical vs statistical significance

What are we to make of this?

From the perspective of effect size, do the larger study results carry more weight simply because the hypothesis of no association can be rejected? Even though the practical significance has remained the same?

We can turn a very small, 3.2% effect into a statistically significant effect by simply increasing the sample size.

But does this change the practical significance of the 3.2%?

No.

If 3.2% was deemed by the study investigators to be practically insignificant, it remains practically insignificant. Despite the larger sample size and despite it now being statistically significant.[4]

A curse of data “bigness”

With a large enough sample, everything is statistically significant, even associations that are practically not significant or very interesting.

The implication is that rather than focusing on hypothesis testing as sample sizes increase, the focus should shift. Towards the size of the estimated effect, whether the estimated effect is “practically” important, and “sensitivity analysis” (i.e., how does the estimated effect change when control variables are added and dropped).[5]

Confidence intervals can and should play a role. But they will get narrower and narrower as sample sizes grow. And everything within the confidence interval could still be deemed not practically important.

In sum, as data get bigger (and it does not take massive amounts of data for this to be an issue), we need to guard against concluding that a small effect is practically significant just because the p-value is very small (i.e., the effect is statistically significant).

The curse of big data is still very much with us.

[1] A ratio of 1.0 would mean no difference in effect between the treatment types.

[2] As a point of comparison, the 2020 Moderna and Pfizer COVID-19 vaccines trials consisted of about 30,000 and 40,000 subjects.

[3] A more complicated technique is used to calculate confidence intervals for actual clinical trial results than used here, which typically result in wider intervals. For example, in 2020, Moderna reported an efficacy rate of 94.1% for its COVID-19 vaccine with a 95% confidence interval of 89.3% to 96.8%.

[4] Since the standard error of the relative risk ratio estimate is based on the cell counts in the contingency table, increasing the size of the sample lowers the standard error, making it more likely we can reject the null hypothesis at a given level of confidence.

[5] The paper Too Big to Fail presents a nice discussion of these issues. Additionally, the American Statistical Association released recommendations on the reporting of p-values.

The post Curse of Big Data appeared first on KDD Analytics.

Efficacy vs Effectiveness of the COVID Vaccines…”tomato, tomahto”?

KDD — Thu, 08 Apr 2021 18:15:48 +0000

You like potato and I like potahto
You like tomato and I like tomahto
Potato, potahto, tomato, tomahto
Let’s call the whole thing off

But oh, if we call the whole thing off
Then we must part
And oh, if we ever part
then that might break my heart

—Ira Gershwin

The eye-popping efficacy rates reported for the Moderna (94%), Pfizer (95%) and, to a lesser extent, the Johnson & Johnson (66%) COVID-19 vaccines have undoubtedly not escaped your attention.

But what is vaccine efficacy and how is it calculated? And how does it differ from vaccine effectiveness?

Moderna vaccine efficacy

First, consider efficacy. Using Moderna’s reported clinical trial results as an example, we see that it is a straightforward calculation.

Moderna reported results from it’s COVID-19 vaccine trial in November 2020. The results are shown below in a 2×2 “contingency” or “cross-tabulation” table. The columns show the number of subjects who were infected (or not); the rows show the number who received the vaccine (or the placebo). And the cells show the intersection of those two events.

Relative risk

The strength of the association, or the effect size, between receiving the vaccine and not getting infected is measured by the relative risk.

The probability or risk of a vaccinated subject being infected is 0.08%. That is, (11 / 14,134) or the expected number of events / sum of events and non-events. For a subject receiving the placebo, the probability of infection is higher at 1.31% (i.e., 185 / 14,073).

So, using the placebo group as the reference group, the relative risk is (11 / 14,134) / (185 / 14,073) or 0.059.[1]

In other words, the risk of a vaccinated person being infected is 94.1% lower compared to a subject who received the placebo (i.e., (1 – 0.059) * 100)).

It is this calculation of 94.1% that was reported by Moderna as the vaccine’s efficacy rate.[2]

Vaccine effectiveness

So, what about vaccine effectiveness? The term effectiveness refers to how the vaccine performs in the real world. Efficacy refers to how the vaccine performs under “optimal” conditions of a clinical trial.

Clinical trials are based on a sample of subjects who may not be fully representative of the general population (e.g., all comorbidities are not controlled for). In addition, the COVID strain that existed in the population during the clinical trial period may not be the same that occurs when the vaccine is released. Also, vaccine transportation, storage and delivery may differ from the more controlled environment of the clinical trial. Thus, the effectiveness of the vaccine may be different from what was found during the clinical trial.

Studies on COVID vaccine effectiveness

So, do we have any data yet on the real-world effectiveness of the COVID vaccines? It takes time to collect data, but we do have some indication that vaccine effectiveness is very high.

An early study appeared February 24, 2021 in the New England Journal of Medicine. The study examined the Pfizer vaccine performance in Israel. The sample was matched data from over 1 million people, half who were vaccinated between December 2020 to February 2021 and half who were not. The results of the study suggest a symptomatic infection effectiveness rate of 94% 7+ days after the second dose.

A more recent study released by the CDC on April 2 examined both the Pfizer and Moderna vaccines. This study used US data from December 2020 to March 2021. The sample consisted of 3,950 health care personnel, first responders, and other front-line workers. The study found that the vaccines were 90% effective against COVID infection 14+ days after the second dose. Even 14+ days after the first dose the vaccines were 80% effective.

As a point of comparison, according to the CDC, effectiveness of the annual flu vaccination ranges between 40 and 60%.[3]

So, the effectiveness rate, after 2 doses of the Pfizer and Moderna vaccines, appears to be very close in magnitude to the efficacy rate.

Very good news indeed!

Tomato, tomahto?

[1] A relative risk ratio of 1.0 would mean no difference in effect between the treatment types.

[2] A summary of efficacy rates across the range of current COVID vaccines can be found here.

[3] One reason for the range is that the flu strain that is in circulation can differ from what was predicted when the annual flu vaccine was developed earlier in the year.

The post Efficacy vs Effectiveness of the COVID Vaccines…”tomato, tomahto”? appeared first on KDD Analytics.