<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	
	xmlns:georss="http://www.georss.org/georss"
	xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#"
	>

<channel>
	<title>KDD Analytics</title>
	<atom:link href="https://www.kddanalytics.com/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.kddanalytics.com/</link>
	<description>Data to Decisions</description>
	<lastBuildDate>Mon, 28 Jun 2021 00:37:12 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.8.3</generator>

<image>
	<url>https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2016/08/cropped-imageedit_1_7939659602.png?fit=32%2C32&#038;ssl=1</url>
	<title>KDD Analytics</title>
	<link>https://www.kddanalytics.com/</link>
	<width>32</width>
	<height>32</height>
</image> 
<site xmlns="com-wordpress:feed-additions:1">114932494</site>	<item>
		<title>If odds are not odd, what about odds ratios?</title>
		<link>https://www.kddanalytics.com/if-odds-are-not-odd-what-about-odds-ratios/</link>
		
		<dc:creator><![CDATA[KDD]]></dc:creator>
		<pubDate>Mon, 28 Jun 2021 00:37:12 +0000</pubDate>
				<category><![CDATA[Categorical Data Analysis]]></category>
		<category><![CDATA[Data Analytics Methods]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[case_control]]></category>
		<category><![CDATA[meta-analysis]]></category>
		<category><![CDATA[odds]]></category>
		<category><![CDATA[odds ratio]]></category>
		<category><![CDATA[propective]]></category>
		<category><![CDATA[retropsective]]></category>
		<guid isPermaLink="false">https://www.kddanalytics.com/?p=2022</guid>

					<description><![CDATA[<p>What are the odds of developing a brain tumor from long-term use of cell phones? This is an evolving area of research.  Some studies have found an association and others have not. But two recent meta-analyses suggest that the odds are about 33 to 44% greater due to long-term cell phone usage. Got your attention?&#8230;</p>
<p>The post <a href="https://www.kddanalytics.com/if-odds-are-not-odd-what-about-odds-ratios/">If odds are not odd, what about odds ratios?</a> appeared first on <a href="https://www.kddanalytics.com">KDD Analytics</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>What are the <strong>odds of developing a brain tumor from long-term use of cell phones</strong>?</p>
<p>This is an evolving area of research.  Some studies have found an association and others have not.</p>
<p>But two recent <strong><em><a href="https://en.wikipedia.org/wiki/Meta-analysis" target="_blank" rel="noopener">meta-analyses</a></em></strong> suggest that the odds are about <strong>33 to 44%</strong> <strong>greater</strong> due to long-term cell phone usage.</p>
<p>Got your attention?</p>
<p>“But what does this do to my odds of developing a brain tumor?” you may ask.</p>
<p>Before we answer that, we need to explain how the meta-analyses derive this 33 to 44% figure.  Which introduces us to <strong><em>odds ratios</em></strong>.</p>
<h2>Case-control studies</h2>
<p>Studies of the association between cell phone usage and brain tumor are typically <strong><em>case-control</em></strong> studies.</p>
<p>Such studies are <em><strong>retrospective</strong></em>, as opposed to <em><strong>prospective</strong></em>.<a href="#_ftn1" name="_ftnref1">[1]</a> They combine a sample of patients (<strong><em>cases</em></strong>) already diagnosed with a brain tumor with a random sample of non-patients (<strong><em>controls</em></strong>) drawn from the general population. Study investigators match controls to each case based on key demographics such as sex, age, and region.</p>
<p>The studies then measure and test for the existence of an association between <strong><em>exposure</em></strong> (cell phone usage) and <strong><em>outcome</em></strong> (brain tumor).<a href="#_ftn2" name="_ftnref2">[2]</a></p>
<p>Typically, these case-control studies report their <strong>estimated effects</strong>, not in terms of odds, but in terms of <strong>odds ratios</strong>.</p>
<p>So, what is an odds ratio?</p>
<h2>Odds ratios</h2>
<p>An odds ratio is a <strong>measure of association strength.</strong> In this case, between cell phone usage and the diagnosis of a brain tumor.</p>
<p>As an example, we can use the results from one of the <strong><em>high-quality</em></strong> <strong><a href="https://pubmed.ncbi.nlm.nih.gov/16023098/" target="_blank" rel="noopener">studies</a></strong> used in the meta-analyses mentioned above to show how odds ratios are calculated.<a href="#_ftn3" name="_ftnref3">[3]</a></p>
<p>The data shown in the following table are from a case-control study conducted in Sweden between 2000 and 2003.<a href="#_ftn4" name="_ftnref4">[4]</a>  The data are for long term cell phone usage (&gt;= 10 years). The reference category is no cell phone usage.<a href="#_ftn5" name="_ftnref5">[5]</a></p>
<p><img data-recalc-dims="1" decoding="async" class="alignnone size-full wp-image-2023 aligncenter" src="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/04/Odds-ratios.png?resize=399%2C120&#038;ssl=1" alt="cell phones and brain tumors" width="399" height="120" srcset="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/04/Odds-ratios.png?w=399&amp;ssl=1 399w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/04/Odds-ratios.png?resize=300%2C90&amp;ssl=1 300w" sizes="(max-width: 399px) 100vw, 399px" /><br />
In an earlier <strong><a href="https://www.kddanalytics.com/odds-and-probability-two-sides-of-the-same-coin/">article</a></strong> we learned that the odds of an event occurring are the number of events divided by the number of non-events.</p>
<p>Thus, the <strong>odds of a long-term cell phone user in this sample being diagnosed with a brain tumor</strong> is (16 / 232) or 0.069; about 1 to 14.</p>
<p>The <strong>odds of a non-cell phone user being diagnosed with a brain tumor</strong> is (18 / 674) or 0.027; about 1 to 37.</p>
<p>The <strong>odds ratio is simply the ratio of the two odds</strong>:  (0.069 / 0.027) or 2.582.</p>
<p>So, the odds of a long-term cell phone user being diagnosed with a brain tumor are <strong>2.582 times greater compared to a non-cell phone user</strong>.</p>
<p>Alternatively, this can be stated in <strong>terms of a % difference</strong>. The odds of a long-term cell phone user being diagnosed with a brain tumor are <strong>158% greater compared to a non-cell phone user</strong> ((2.582 – 1) * 100).</p>
<p>That is a pretty large effect.<a href="#_ftn6" name="_ftnref6">[6]</a></p>
<h2>Meta-studies</h2>
<p><strong>Now this is just one study</strong>.  The two meta-studies alluded to above each combined the results of 7 different, high-quality studies.</p>
<p>They found that the overall odds (across the studies) of a long-term cell phone user (&gt;= 10 years) being diagnosed with a brain tumor (any tumor type) are <a href="https://pubmed.ncbi.nlm.nih.gov/28213724/" target="_blank" rel="noopener"><strong>33%</strong></a> and (with respect to <a href="https://www.mayoclinic.org/diseases-conditions/glioma/symptoms-causes/syc-20350251" target="_blank" rel="noopener"><strong>glioma</strong></a>, a common type of tumor) <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5417432/" target="_blank" rel="noopener"><strong>44%</strong></a> <strong>greater compared to a non-cell phone user</strong>.<a href="#_ftn7" name="_ftnref7">[7]</a></p>
<p>These meta-studies found no effect due to cell phone usage over a shorter period (i.e., &lt; 10 years).</p>
<p>So, it appears that the risk, if it exists, is associated with long-term usage.  Moreover, using a cell phone on the same side of the head is associated with <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5417432/" target="_blank" rel="noopener"><strong>46%</strong></a> greater odds of developing a glioma on that side of the head.<a href="#_ftn8" name="_ftnref8">[8]</a></p>
<h2>Odds of developing a brain tumor</h2>
<p>So, <strong>back to our original question</strong>.  What are the odds of developing a brain tumor from long term cell phone usage?</p>
<p>The odds of developing a brain tumor among the general population is very low to start with.  Annual <strong><a href="https://seer.cancer.gov/statfacts/html/brain.html" target="_blank" rel="noopener">incidence</a></strong> in the US (2018) is 6.5 per 100,00 or 0.0065%.  In terms of odds, this is about 1 to 15,000.</p>
<p>So, a 44% increase in the odds would mean 9.4 per 100,000 or about 1 to 10,000.  Still quite low.<a href="#_ftn9" name="_ftnref9">[9]</a></p>
<p>As one <a href="https://academic.oup.com/jnci/article/103/15/1146/2516666" target="_blank" rel="noopener"><strong>researcher</strong></a> put it, “Your chance of being hurt by distracted driving because you’re using your cell phone wipes out the risk of getting cancer.”</p>
<p>However, in 2011 the World Health Organization’s International Agency for Research on Cancer (<a href="https://iarc.who.int/" target="_blank" rel="noopener"><strong>IARC</strong></a>) <strong><u>did</u> classify</strong> cell phones as a Group 2B <strong>carcinogen</strong> (i.e., possibly causes cancer).</p>
<p>And there continues to be a healthy debate in both the statistical and public arenas.</p>
<p><a href="https://ehtrust.org/scientific-documentation-cell-phone-radiation-associated-brain-tumor-rates-rising/" target="_blank" rel="noopener"><strong>Studies</strong></a> are continuing to be released which purportedly finding evidence that recent increasing rates in <a href="https://en.wikipedia.org/wiki/Glioblastoma" target="_blank" rel="noopener"><strong>glioblastomas</strong></a>, an aggressive type of cancer, are tied to cell phone usage.</p>
<p><a href="https://www.forbes.com/sites/geoffreykabat/2017/12/23/are-brain-cancer-rates-increasing-and-do-changes-relate-to-cell-phone-use/" target="_blank" rel="noopener"><strong>Skeptics</strong></a> argue that changes in WHO classification of what is considered a glioblastoma may be responsible for any uptick in brain tumor incidence. And that the large, increased risk reported by studies, like the meta-studies discussed above, are <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4057143/" target="_blank" rel="noopener"><strong>inconsistent</strong></a> with the historical trend in brain tumor incidence.<a href="#_ftn10" name="_ftnref10">[10]</a></p>
<p><strong>As we said at the outset, this is an evolving area of research, with lots of issues to untangle.</strong></p>
<p>One thing to keep in mind, though, is <strong>who is funding the research</strong>.  A topic we will cover in a later article.</p>
<h2>We have odds ratios to thank</h2>
<p><strong>Back to the main point of this article.</strong></p>
<p>Odds facilitate the measurement of the <strong><span style="text-decoration: underline;">relative</span> likelihood of events</strong>.  Epidemiological studies that are retrospective, commonly use the <strong>odds ratio as this relative measurement of association strength</strong>.</p>
<p>So, the next time you hear that your favorite dietary choice increases your chances of developing cancer, it is probably the result of that not-so-oddity, the odds ratio.</p>
<p>&nbsp;</p>
<p><a href="#_ftnref1" name="_ftn1">[1]</a> Prospective cohort studies have also been used (i.e., studies which track subjects over time).  See <strong><a href="https://www.cognibrain.com/retrospective-vs-prospective-study-advantages-types-and-differences/" target="_blank" rel="noopener">here</a></strong> for a summary of the advantages and disadvantages of retrospective and prospective studies.</p>
<p><a href="#_ftnref2" name="_ftn2">[2]</a> Exposure is determined by answers to a lengthy questionnaire. Hence, one of the criticisms levied against case-control studies is respondent <strong><a href="https://catalogofbias.org/biases/recall-bias/" target="_blank" rel="noopener">recall bias</a></strong>. That is, whether respondents accurately recall their cell phone usage, particularly over a long period of time.</p>
<p><a href="#_ftnref3" name="_ftn3">[3]</a> Studies are <a href="http://www.ohri.ca/programs/clinical_epidemiology/oxford.asp" target="_blank" rel="noopener"><strong>graded</strong></a> on a quality scale considering such factors as <strong>selection</strong> of cases and controls, <strong>comparability</strong> of cases and controls based on study design, and proper assessment/measurement of <strong>exposure</strong>.</p>
<p><a href="#_ftnref4" name="_ftn4">[4]</a> The results shown in the table are taken from a <a href="https://pubmed.ncbi.nlm.nih.gov/28213724/" target="_blank" rel="noopener"><strong>meta-study</strong></a> which considered this <a href="https://pubmed.ncbi.nlm.nih.gov/16023098/" target="_blank" rel="noopener"><strong>Hardell et al</strong></a> (2006) study.</p>
<p><a href="#_ftnref5" name="_ftn5">[5]</a> As cell phone usage becomes more ubiquitous, and fewer people who have never used a cell phone are available in the population, the exposure will need to be increasingly measured in terms of levels/frequency of usage.</p>
<p><a href="#_ftnref6" name="_ftn6">[6]</a> The additional risk derived using an odds ratio is closely related to the concept of <strong><em>efficacy</em></strong>, which is derived directly from the concept of <strong><em>relative risk</em></strong> (ratio of probabilities). We covered efficacy in an earlier <strong><a href="https://www.kddanalytics.com/covid-vaccine-efficacy-effectiveness/" target="_blank" rel="noopener">article</a></strong>. Epidemiologists typically use relative risk to measure association strength in prospective (cohort) studies; odds ratios in case-control studies.</p>
<p><a href="#_ftnref7" name="_ftn7">[7]</a> Meta-studies start with a larger number of studies.  They then cull studies from the final sample for various reasons, such as data availability and the quality grade they receive.</p>
<p><a href="#_ftnref8" name="_ftn8">[8]</a> All these studies on brain tumors controlled for whether cell phones were being used next to users’ heads.</p>
<p><a href="#_ftnref9" name="_ftn9">[9]</a> <strong><a href="https://seer.cancer.gov/statfacts/" target="_blank" rel="noopener">See</a></strong> for US cancer incidence rates as of 2018.</p>
<p><a href="#_ftnref10" name="_ftn10">[10]</a> See also <a href="https://www.forbes.com/sites/geoffreykabat/2017/12/27/what-the-best-u-s-data-have-to-say-about-brain-cancer-rates/" target="_blank" rel="noopener"><strong>Geoffrey Kabat</strong></a> (2017).</p>
<p>&nbsp;</p>
<p>The post <a href="https://www.kddanalytics.com/if-odds-are-not-odd-what-about-odds-ratios/">If odds are not odd, what about odds ratios?</a> appeared first on <a href="https://www.kddanalytics.com">KDD Analytics</a>.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">2022</post-id>	</item>
		<item>
		<title>Odds and probability&#8230;two sides of the same coin</title>
		<link>https://www.kddanalytics.com/odds-and-probability-two-sides-of-the-same-coin/</link>
		
		<dc:creator><![CDATA[KDD]]></dc:creator>
		<pubDate>Fri, 04 Jun 2021 16:59:03 +0000</pubDate>
				<category><![CDATA[Categorical Data Analysis]]></category>
		<category><![CDATA[Data Analysis]]></category>
		<category><![CDATA[Data Analytics Methods]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[odds]]></category>
		<category><![CDATA[probability]]></category>
		<guid isPermaLink="false">https://www.kddanalytics.com/?p=1993</guid>

					<description><![CDATA[<p>What are the lifetime odds of dying from being hit by a meteorite? 1 in 1,600,000. Yep, not very likely.  You are much more likely to die from a dog attack (1 in 86,781) or from a lightning strike (1 in 138,849). But why odds? Why not express these likelihoods in terms of probabilities?  Seems&#8230;</p>
<p>The post <a href="https://www.kddanalytics.com/odds-and-probability-two-sides-of-the-same-coin/">Odds and probability&#8230;two sides of the same coin</a> appeared first on <a href="https://www.kddanalytics.com">KDD Analytics</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>What are the lifetime odds of dying from being hit by a <a href="https://www.tulane.edu/~sanelson/Natural_Disasters/impacts.htm" target="_blank" rel="noopener"><strong>meteorite</strong></a>?</p>
<p>1 in 1,600,000.</p>
<p>Yep, not very likely.  You are much more likely to die from a <a href="https://injuryfacts.nsc.org/all-injuries/preventable-death-overview/odds-of-dying/" target="_blank" rel="noopener"><strong>dog attack</strong></a> (1 in 86,781) or from a <strong><a href="https://injuryfacts.nsc.org/all-injuries/preventable-death-overview/odds-of-dying/" target="_blank" rel="noopener">lightning strike</a></strong> (1 in 138,849).</p>
<p>But why odds?</p>
<p>Why not express these likelihoods in terms of probabilities?  Seems like a more natural way to express the chance of an event occurring, doesn’t it?</p>
<p>Odds, however, are commonly used to express event risk.  And of course, the chances of winning a sporting event.</p>
<p>As we write, the odds of the <strong><a href="https://www.mlb.com/dodgers" target="_blank" rel="noopener">Los Angeles Dodgers</a></strong> repeating as World Series champions in 2021 are 1 in 3.<a href="#_ftn1" name="_ftnref1">[1]</a></p>
<h2>So, what are odds?</h2>
<p style="text-align: center;"><strong>The number of times an event occurs divided by the number of times it does not occur</strong>.</p>
<p>In the case of a meteorite strike, for every person that dies, 1.6 million do not.  In the case of the Dodgers, we would expect the Dodgers to win 1 World Series for every 3 they lose.</p>
<p>But this still begs the question, <strong>why odds and not probabilities?</strong></p>
<h2>Odds and probabilities</h2>
<p>It is true that the probability of a low likelihood event is so small that stating it as a % requires a lot of zeros after the decimal (0.0000625% in the case of dying from a meteorite strike).</p>
<p>But that is not an insurmountable objection. For example, the risk of disease is often expressed in terms of rates per 100,000 to make the chances of low likelihood events easier to comprehend. Or we could state the probability of the non-event&#8230;not dying from a meteorite strike (99.994%).</p>
<p>A more important reason for using odds is that they facilitate <strong>multiplicative comparisons</strong>.</p>
<p>A simple example makes clear how probabilities can fall short.</p>
<p>Suppose the probability that Beth will go out to dinner this weekend is 75%. We <strong>cannot</strong> say, then, that the probability of Jose doing the same is 3 times that of Beth’s probability.</p>
<p>Why?  Probabilities are constrained to lie between 0 and 1. And 3 * 0.75 &gt; 1.0.</p>
<p>So, what do we do?  Enter odds.</p>
<h3>Odds are unconstrained</h3>
<p>Odds are only bounded on the low end, by 0.  Let&#8217;s return to Beth and Jose.</p>
<p>The odds of Beth going out to dinner are 3 or 3/1.  Why 3/1?</p>
<p>Remember odds are the ratio of the events to non-events. Beth is 75% likely to go out. So, if she is faced with 4 opportunities to go out, she will do so 3 times.  In other words, she will go out (event) 3 times for every time she stays home (non-event).  3 to 1.</p>
<p>Now, if Jose is 3 times as likely to go out as Beth, his odds are simply 3 * 3 or 9.  Equivalently, we can express his odds as 9 to 1 or 9/1.</p>
<p>On the odds scale, odds can be 2, 10, 50 times greater…there is no upper limit. And this makes them very useful when we wish to compare the <strong>relative likelihood</strong> of events occurring.</p>
<h3>Two sides of the same coin</h3>
<p>It turns out that if we are still interested in the probability, we can easily derive it from the odds.  <strong>Odds and probability are two sides of the same coin</strong>.</p>
<p>Odds (o) are related to probability (p) by the following:</p>
<p style="text-align: center;"><em><strong>o = p / (1 &#8211; p) = (probability of event / probability of non-event)</strong></em></p>
<p>Rearranging we find the “other side of the coin” (for an event):</p>
<p style="text-align: center;"><em><strong>p = o / (1 + o) = (odds of event) / (1 + odds of event)</strong></em></p>
<p>So, in the case of Beth and Jose we get:</p>
<p><img data-recalc-dims="1" decoding="async" class="size-full wp-image-2083 aligncenter" src="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/05/Odds-to-probability.png?resize=429%2C103&#038;ssl=1" alt="odds probability same coin different sides" width="429" height="103" srcset="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/05/Odds-to-probability.png?w=429&amp;ssl=1 429w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/05/Odds-to-probability.png?resize=300%2C72&amp;ssl=1 300w" sizes="(max-width: 429px) 100vw, 429px" /></p>
<p>The relationship between odds and probability is shown graphically below.</p>
<p><img data-recalc-dims="1" fetchpriority="high" decoding="async" class="aligncenter wp-image-1995 " src="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/04/Odds-vs-probability.png?resize=683%2C451&#038;ssl=1" alt="what are odds" width="683" height="451" srcset="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/04/Odds-vs-probability.png?w=856&amp;ssl=1 856w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/04/Odds-vs-probability.png?resize=300%2C198&amp;ssl=1 300w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/04/Odds-vs-probability.png?resize=768%2C507&amp;ssl=1 768w" sizes="(max-width: 683px) 100vw, 683px" /></p>
<p>As the odds increase, the probability also increases but in a non-linear manner.  As shown above, the probability &#8220;increases at a decreasing rate&#8221; and approaches 1.0 “asymptotically” (i.e., as the odds get very large, the probability approaches but never quite reaches 1.0).<a href="#_ftn2" name="_ftnref2">[2]</a></p>
<p>But any finite odds will map to a probability between 0 and 1.</p>
<h2>Odds are preferred</h2>
<p>When comparing the relative chances of events (or sports teams), odds are the preferred way of expressing how much more likely one event is over another.  We can always derive the associated probability.  But since odds are unconstrained,  there is no issue with saying the Los Angeles Dodgers are <a href="https://www.oddsshark.com/mlb/world-series-odds" target="_blank" rel="noopener"><strong>11 times</strong></a> as likely (as of May 31, 2021) to win the World Series in 2021 than the Chicago Cubs.</p>
<p>So, the next time someone tells you the odds of rain during your camping trip this weekend are 5 to 2, you might want to sleep in a tent.</p>
<p>&nbsp;</p>
<p><a href="#_ftnref1" name="_ftn1">[1]</a> As of May 31, 2021, the reported <a href="https://www.oddsshark.com/mlb/world-series-odds" target="_blank" rel="noopener"><strong>odds</strong></a> of the Dodgers repeating are 3 to 1 or <a href="https://www.oddsshark.com/tools/odds-calculator" target="_blank" rel="noopener"><strong>3/1</strong></a>.  In the betting world, this is referred to as <a href="https://www.sportsbettingdime.com/guides/betting-101/how-to-read-sports-odds/" target="_blank" rel="noopener"><strong>fractional odds</strong></a>. The <strong>number on the left</strong> or numerator is typically the <strong>number of times</strong> the team is <strong>expected to lose</strong>. 3/1 yields an implied probably of losing 3 times out of 4 or 75%.  Thus, the probability of the Dodgers repeating are (1 – 0.750) or 25%.  Expressing as odds yields 1/3.</p>
<p><a href="#_ftnref2" name="_ftn2">[2]</a> In the limit, if the odds = infinity, then probability = 1.</p>
<p>&nbsp;</p>
<p>The post <a href="https://www.kddanalytics.com/odds-and-probability-two-sides-of-the-same-coin/">Odds and probability&#8230;two sides of the same coin</a> appeared first on <a href="https://www.kddanalytics.com">KDD Analytics</a>.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">1993</post-id>	</item>
		<item>
		<title>Tableau Basics: SUM vs AVG</title>
		<link>https://www.kddanalytics.com/are-you-using-the-correct-tableau-aggregation-sum-vs-avg/</link>
		
		<dc:creator><![CDATA[KDD]]></dc:creator>
		<pubDate>Fri, 21 May 2021 17:49:37 +0000</pubDate>
				<category><![CDATA[Data Visualization]]></category>
		<category><![CDATA[Tableau]]></category>
		<category><![CDATA[data visualization]]></category>
		<guid isPermaLink="false">https://www.kddanalytics.com/?p=2086</guid>

					<description><![CDATA[<p>First time users of Tableau often get tripped up over the default Tableau SUM aggregation.  Here is what I mean. Suppose the question is to find the average of SALES PER VISIT (sales measured across the preceding 6 months) among the males and females in a sample of 25 shoppers.  The data look like this&#8230;</p>
<p>The post <a href="https://www.kddanalytics.com/are-you-using-the-correct-tableau-aggregation-sum-vs-avg/">Tableau Basics: SUM vs AVG</a> appeared first on <a href="https://www.kddanalytics.com">KDD Analytics</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>First time users of <strong><a href="https://www.tableau.com/" target="_blank" rel="noopener">Tableau</a></strong> often get tripped up over the default Tableau SUM aggregation.  Here is what I mean.</p>
<p>Suppose the question is to find the average of SALES PER VISIT (sales measured across the preceding 6 months) among the males and females in a sample of 25 shoppers.  The data look like this in Excel:</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" class="size-full wp-image-2087 aligncenter" src="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/05/Tableau-SUM-vs-AVG-2.png?resize=330%2C651&#038;ssl=1" alt="Tableau and Excel" width="330" height="651" srcset="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/05/Tableau-SUM-vs-AVG-2.png?w=330&amp;ssl=1 330w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/05/Tableau-SUM-vs-AVG-2.png?resize=152%2C300&amp;ssl=1 152w" sizes="auto, (max-width: 330px) 100vw, 330px" /></p>
<p><strong>TIP:</strong>  We can easily input these data to Tableau by <strong><a href="https://www.thedataschool.co.uk/borja-leiva/tableau-tuesday-tip-paste-data-clipboard" target="_blank" rel="noopener">cutting and pasting</a></strong> the selection into the Tableau canvas:</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" class="alignnone wp-image-2088 size-large" src="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/05/Tableau-SUM-vs-AVG-1.png?resize=1024%2C668&#038;ssl=1" alt="Cut and paste data into Tableau" width="1024" height="668" srcset="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/05/Tableau-SUM-vs-AVG-1.png?resize=1024%2C668&amp;ssl=1 1024w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/05/Tableau-SUM-vs-AVG-1.png?resize=300%2C196&amp;ssl=1 300w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/05/Tableau-SUM-vs-AVG-1.png?resize=768%2C501&amp;ssl=1 768w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/05/Tableau-SUM-vs-AVG-1.png?w=1441&amp;ssl=1 1441w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /></p>
<p>Now to answer the question.</p>
<p>First time users of Tableau may correctly put GENDER on the row shelf and SALES PER VISIT on the column shelf.  Tableau defaults to a bar chart yielding:</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" class="alignnone size-large wp-image-2091" src="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/05/Tableau-SUM-vs-AVG-3.png?resize=1024%2C651&#038;ssl=1" alt="Tableau bar chart" width="1024" height="651" srcset="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/05/Tableau-SUM-vs-AVG-3.png?resize=1024%2C651&amp;ssl=1 1024w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/05/Tableau-SUM-vs-AVG-3.png?resize=300%2C191&amp;ssl=1 300w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/05/Tableau-SUM-vs-AVG-3.png?resize=768%2C489&amp;ssl=1 768w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/05/Tableau-SUM-vs-AVG-3.png?w=1440&amp;ssl=1 1440w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /></p>
<p>First time users may also put SALES PER VISIT on the Label Marks <a href="https://help.tableau.com/current/pro/desktop/en-us/buildmanual_shelves.htm" target="_blank" rel="noopener"><strong>card</strong></a>, which displays the value next to each bar in the chart.  They may even put GENDER on the Color Marks card to give the viz some pop.</p>
<p>And then they call it done.  Males spend more on average than females.</p>
<p>But do they?</p>
<p>We note that the green pills show SUM(Sales per Visit).  Tableau’s default <strong>“aggregation”</strong> is to sum the values across the rows in the data set.</p>
<p>Going back to Excel, if we sum SALES PER VISIT by GENDER across the 25 rows, we get, using the <strong><a href="https://www.excel-easy.com/examples/sumif.html" target="_blank" rel="noopener">SUMIF</a></strong> function:</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" class="size-full wp-image-2092 aligncenter" src="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/05/Tableau-SUM-vs-AVG-7.png?resize=274%2C95&#038;ssl=1" alt="Excel SUMIF and AVERAGEIF" width="274" height="95" /></p>
<p>This is exactly what Tableau shows.</p>
<p><strong>But we want to find the average.</strong>  In Excel, using the <strong><a href="https://www.excel-easy.com/examples/averageif.html" target="_blank" rel="noopener">AVERAGEIF</a></strong> function, we see that females spend on average $11.62 while males spend $9.13 per visit.</p>
<p>To get Tableau to match, we simply <strong>change the aggregation</strong> by right-clicking on <strong>each</strong> of the green pills and select Measure (Average) from the drop-down menu.</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" class="alignnone size-large wp-image-2093" src="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/05/Tableau-SUM-vs-AVG-5.png?resize=1024%2C649&#038;ssl=1" alt="Tableau change aggregation" width="1024" height="649" srcset="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/05/Tableau-SUM-vs-AVG-5.png?resize=1024%2C649&amp;ssl=1 1024w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/05/Tableau-SUM-vs-AVG-5.png?resize=300%2C190&amp;ssl=1 300w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/05/Tableau-SUM-vs-AVG-5.png?resize=768%2C487&amp;ssl=1 768w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/05/Tableau-SUM-vs-AVG-5.png?w=1443&amp;ssl=1 1443w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /></p>
<p>Now we get the correct answer to our question.</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" class="alignnone size-large wp-image-2094" src="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/05/Tableau-SUM-vs-AVG-6.png?resize=1024%2C648&#038;ssl=1" alt="Tableau correct aggregation" width="1024" height="648" srcset="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/05/Tableau-SUM-vs-AVG-6.png?resize=1024%2C648&amp;ssl=1 1024w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/05/Tableau-SUM-vs-AVG-6.png?resize=300%2C190&amp;ssl=1 300w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/05/Tableau-SUM-vs-AVG-6.png?resize=768%2C486&amp;ssl=1 768w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/05/Tableau-SUM-vs-AVG-6.png?w=1442&amp;ssl=1 1442w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /></p>
<p>If Tableau is not yielding the correct answer, try thinking about how you would do it in Excel.  Sometimes, but not always, this will provide the proper guidance.</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>The post <a href="https://www.kddanalytics.com/are-you-using-the-correct-tableau-aggregation-sum-vs-avg/">Tableau Basics: SUM vs AVG</a> appeared first on <a href="https://www.kddanalytics.com">KDD Analytics</a>.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">2086</post-id>	</item>
		<item>
		<title>Curse of Big Data</title>
		<link>https://www.kddanalytics.com/curse-of-big-data/</link>
		
		<dc:creator><![CDATA[KDD]]></dc:creator>
		<pubDate>Mon, 03 May 2021 11:31:59 +0000</pubDate>
				<category><![CDATA[Categorical Data Analysis]]></category>
		<category><![CDATA[Data Analytics Methods]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[efficacy]]></category>
		<category><![CDATA[hypothesis testing]]></category>
		<category><![CDATA[practical significance]]></category>
		<category><![CDATA[statistical significance]]></category>
		<category><![CDATA[statistics]]></category>
		<guid isPermaLink="false">https://www.kddanalytics.com/?p=1961</guid>

					<description><![CDATA[<p>“Big data.” We checked in with Google search trends recently. Appears that “Big Data” has lost its luster search-wise…started trending down about 4 years ago. Nowadays, everything is big data? Implications of big data However, this does not mean we should lose sight of certain statistical implications associated with being “big”. Yes, large amounts of&#8230;</p>
<p>The post <a href="https://www.kddanalytics.com/curse-of-big-data/">Curse of Big Data</a> appeared first on <a href="https://www.kddanalytics.com">KDD Analytics</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>“Big data.”</p>
<p>We checked in with <strong><a href="https://trends.google.com/trends/?geo=US" target="_blank" rel="noopener">Google</a></strong> search trends recently. Appears that “Big Data” has lost its luster search-wise…started trending down about 4 years ago.</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" class="alignnone size-large wp-image-1963" src="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/04/Big-Data-Trend.png?resize=1024%2C673&#038;ssl=1" alt="curse of big data" width="1024" height="673" srcset="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/04/Big-Data-Trend.png?resize=1024%2C673&amp;ssl=1 1024w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/04/Big-Data-Trend.png?resize=300%2C197&amp;ssl=1 300w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/04/Big-Data-Trend.png?resize=768%2C505&amp;ssl=1 768w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/04/Big-Data-Trend.png?w=1203&amp;ssl=1 1203w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /></p>
<p>Nowadays, everything is big data?</p>
<h2>Implications of big data</h2>
<p>However, this does not mean we should lose sight of certain <strong>statistical implications</strong> associated with being “big”. Yes, large amounts of data can help us estimate relationships (<strong>effects</strong>) with a high degree of precision.</p>
<p>And help us uncover low occurrence events such as the blood clotting cases associated with the <strong><a href="https://www.nytimes.com/2021/04/16/health/johnson-vaccine-blood-clot-case.html" target="_blank" rel="noopener">Johnson &amp; Johnson</a></strong> COVID-19 vaccine.</p>
<p>But massive amounts of data can reveal patterns that are not always meaningful or happen by <a href="https://www.analyticbridge.datasciencecentral.com/profiles/blogs/the-curse-of-big-data" target="_blank" rel="noopener"><strong>chance</strong></a>.</p>
<p>Additionally, from a <strong><a href="https://en.wikipedia.org/wiki/Statistical_inference" target="_blank" rel="noopener">statistical inference </a></strong>perspective, with big data, <strong>even small, uninteresting effects can be statistically significant</strong>.</p>
<p>This has important implications for inferential conclusions about the associations we are studying.</p>
<p>And it does not take all that much data for this to happen.</p>
<h3>Small clinical trial example</h3>
<p>As an example, consider the following hypothetical results from a clinical trial of a &#8220;common&#8221; cold vaccine:</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" class="size-full wp-image-1969 aligncenter" src="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/04/Cold-Vaccine-Trial-Small-Sample.png?resize=391%2C159&#038;ssl=1" alt="curse of big data" width="391" height="159" srcset="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/04/Cold-Vaccine-Trial-Small-Sample.png?w=391&amp;ssl=1 391w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/04/Cold-Vaccine-Trial-Small-Sample.png?resize=300%2C122&amp;ssl=1 300w" sizes="auto, (max-width: 391px) 100vw, 391px" /></p>
<p>The table shows the number of subjects who had both a positive outcome (no infection) and negative outcome (infection) across the two types of treatment. A standard statistical test of association, the <a href="https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test" target="_blank" rel="noopener"><strong>Pearson chi-squared</strong></a>, indicates <strong>we cannot say there is any difference in outcomes</strong> across the two treatment types.</p>
<p>That is, we <strong>cannot reject the &#8220;null&#8221; hypothesis of no association</strong> at the 95% level of confidence (i.e., <em>X<sup>2 </sup>=</em> 0.024).</p>
<p>The <strong>strength of the association</strong>, or <strong><em>effect size,</em></strong> is obtained from the <a href="https://www.cdc.gov/csels/dsepd/ss1978/lesson3/section5.html" target="_blank" rel="noopener"><strong>ratio of relative risks</strong></a>.</p>
<p>The probability of a vaccinated subject getting sick is (24 / 59) or 0.407 (40.7%) while that for the placebo group is (29 / 69) or 0.420 (42.0%).</p>
<p>So the relative risk ratio is (0.407 / 0.420) or 0.968.<a href="#_ftn1" name="_ftnref1">[1]</a></p>
<p>Thus, we would expect that when applied to the population, <strong><span style="text-decoration: underline;">under the same conditions as the study</span></strong>, there would be 3.2% fewer infections among those who received the vaccine (i.e., (1 &#8211; 0.968) *100)).</p>
<p>This 3.2% is known as the <a href="https://www.kddanalytics.com/covid-vaccine-efficacy-effectiveness/" target="_blank" rel="noopener"><em><strong>efficacy rate</strong></em></a> of the vaccine.</p>
<p>The 95% confidence interval for the relative risk ratio is wide (i.e., 0.639 to 1.465) indicating a lack of precision in the <strong><em>point estimate</em></strong> of 0.968.</p>
<p>The study investigators conclude that the effect of the vaccine is <strong>neither <span style="text-decoration: underline;">statistically</span> nor <a href="https://statisticsbyjim.com/hypothesis-testing/practical-statistical-significance/" target="_blank" rel="noopener"><em>practically</em></a> significant</strong>.</p>
<p>Aside from its statistical insignificance, an efficacy rate of just 3.2% is not nearly large enough to justify starting production of the vaccine.</p>
<h3>Large clinical trial example</h3>
<p>Contrast this with the following study results based on a much larger sample of 44,800 subjects:<a href="#_ftn2" name="_ftnref2">[2]</a></p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" class="size-full wp-image-1970 aligncenter" src="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/04/Cold-Vaccine-Trial-Large-Sample.png?resize=398%2C155&#038;ssl=1" alt="curse of big data" width="398" height="155" srcset="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/04/Cold-Vaccine-Trial-Large-Sample.png?w=398&amp;ssl=1 398w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/04/Cold-Vaccine-Trial-Large-Sample.png?resize=300%2C117&amp;ssl=1 300w" sizes="auto, (max-width: 398px) 100vw, 398px" /></p>
<p>The Pearson chi-squared statistic (<em>X<sup>2</sup></em>) is now 8.375. Thus, the <strong>hypothesis of no association <u>can be rejected</u> </strong>at the 95% level of confidence.</p>
<p>And the <strong>95% confidence interval </strong>for the relative risk ratio is<strong> much narrower indicating a much higher level of precision</strong> (i.e., 0.947 to 0.990).<a href="#_ftn3" name="_ftnref3">[3]</a></p>
<p>The study investigators now conclude that there is a <strong>statistically significant</strong> association between receiving the vaccine and avoiding a cold infection (positive outcome).</p>
<p><strong>But, </strong>the <strong>relative risk ratio</strong> of a positive outcome from receiving the vaccine is<strong> identical </strong>to that obtained from the smaller study,<strong> 0.968. </strong></p>
<p>Implying the <strong>efficacy rate is also the same, 3.2%</strong>.</p>
<h2>Practical vs statistical significance</h2>
<p>What are we to make of this?</p>
<p>From the perspective of <strong>effect size</strong>, do the larger study results carry more weight <strong>simply because</strong> the hypothesis of no association can be rejected? Even though the <strong><em>practical </em>significance has remained the same</strong>?</p>
<p>We can turn a very small, 3.2% effect into a <span style="text-decoration: underline;"><strong>statistically</strong></span> significant effect by simply increasing the sample size.</p>
<p>But does this <strong>change</strong> the <strong><span style="text-decoration: underline;">practical</span> </strong>significance of the 3.2%?</p>
<p><span style="font-size: 14pt;"><strong><span style="font-size: 12pt;">No</span>.</strong></span></p>
<p>If 3.2% was deemed by the study investigators to be <strong>practically insignificant</strong>, it<strong> remains practically insignificant.</strong> Despite the larger sample size and despite it now being statistically significant.<a href="#_ftn4" name="_ftnref4">[4]</a></p>
<h2>A curse of data &#8220;bigness&#8221;</h2>
<p style="text-align: center;"><strong>With a large enough sample, everything is statistically significant, <span style="text-decoration: underline;">even associations that are practically not significant or very interesting</span>.</strong></p>
<p>The implication is that rather than focusing on hypothesis testing as sample sizes increase, the focus should <strong>shift. Towards</strong> the<strong> size of the estimated effect</strong>, whether the<strong> estimated effect is “practically” important,</strong> and <strong>“sensitivity analysis”</strong> (i.e., how does the estimated effect change when <em><strong>control variables</strong></em> are added and dropped).<a href="#_ftn5" name="_ftnref5">[5]</a></p>
<p><strong>Confidence intervals</strong> can and should play a role. But they will get narrower and narrower as sample sizes grow. And everything within the confidence interval could still be deemed not practically important.</p>
<p>In sum, <strong>as data get bigger</strong> (and it does not take massive amounts of data for this to be an issue), <strong>we need to guard against concluding that a small effect is <span style="text-decoration: underline;">practically</span> significant just because the <a href="https://statisticsbyjim.com/hypothesis-testing/interpreting-p-values/" target="_blank" rel="noopener">p-value</a> is very small</strong> (i.e., the effect is statistically significant).</p>
<p><strong>The curse of big data is still very much with us.</strong></p>
<p>&nbsp;</p>
<p><a href="#_ftnref1" name="_ftn1">[1]</a> A ratio of 1.0 would mean no difference in effect between the treatment types.</p>
<p><a href="#_ftnref2" name="_ftn2">[2]</a> As a point of comparison, the 2020 Moderna and Pfizer COVID-19 vaccines trials consisted of about 30,000 and 40,000 subjects.</p>
<p><a href="#_ftnref3" name="_ftn3">[3]</a> A more complicated technique is used to calculate confidence intervals for actual clinical trial results than used here, which typically result in wider intervals.  For example, in 2020, Moderna <a href="https://www.modernatx.com/covid19vaccine-eua/providers/clinical-trial-data" target="_blank" rel="noopener"><strong>reported</strong></a> an efficacy rate of 94.1% for its COVID-19 vaccine with a 95% confidence interval of 89.3% to 96.8%.</p>
<p><a href="#_ftnref4" name="_ftn4">[4]</a> Since the standard error of the relative risk ratio estimate is based on the cell counts in the <a href="https://www.kddanalytics.com/covid-vaccine-efficacy-effectiveness/" target="_blank" rel="noopener"><strong>contingency table</strong></a>, increasing the size of the sample lowers the standard error, making it more likely we can reject the null hypothesis at a given level of confidence.</p>
<p><a href="#_ftnref5" name="_ftn5">[5]</a> The paper <a href="https://www.galitshmueli.com/system/files/Print%20Version.pdf" target="_blank" rel="noopener"><strong>Too Big to Fail</strong></a> presents a nice discussion of these issues. Additionally, the American Statistical Association released <strong><a href="https://amstat.tandfonline.com/doi/pdf/10.1080/00031305.2016.1154108" target="_blank" rel="noopener">recommendations</a> </strong>on the reporting of p-values.</p>
<p>&nbsp;</p>
<p>The post <a href="https://www.kddanalytics.com/curse-of-big-data/">Curse of Big Data</a> appeared first on <a href="https://www.kddanalytics.com">KDD Analytics</a>.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">1961</post-id>	</item>
		<item>
		<title>San Diego and COVID-19 &#8230; A Very Challenging Year</title>
		<link>https://www.kddanalytics.com/san-diego-and-covid-19-a-very-challenging-year/</link>
		
		<dc:creator><![CDATA[KDD]]></dc:creator>
		<pubDate>Fri, 23 Apr 2021 01:59:25 +0000</pubDate>
				<category><![CDATA[Data Analysis]]></category>
		<category><![CDATA[Data Visualization]]></category>
		<category><![CDATA[Tableau]]></category>
		<category><![CDATA[COVID]]></category>
		<category><![CDATA[data visualization]]></category>
		<guid isPermaLink="false">https://www.kddanalytics.com/?p=2048</guid>

					<description><![CDATA[<p>We just noticed that it has been a full year since we started posting daily updates to our San Diego County COVID-19 dashboard. This dashboard tracks the San Diego COVID experience: new cases, tests, and positivity rates at the county-level as well as new cases for each of the county’s ZIP Codes. On this first-year&#8230;</p>
<p>The post <a href="https://www.kddanalytics.com/san-diego-and-covid-19-a-very-challenging-year/">San Diego and COVID-19 &#8230; A Very Challenging Year</a> appeared first on <a href="https://www.kddanalytics.com">KDD Analytics</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>We just noticed that it has been a full year since we started posting daily updates to our San Diego County COVID-19 <strong><a href="https://public.tableau.com/profile/kdd.analytics#!/vizhome/SanDiegoCountyCOVID-19/SanDiegoCOVID-19">dashboard</a></strong>.</p>
<p>This dashboard tracks the San Diego COVID experience: new cases, tests, and positivity rates at the <strong>county-level</strong> as well as new cases for each of the county’s <strong>ZIP Codes</strong>.</p>
<p>On this first-year anniversary of these daily postings, we thought we would look back at this roller coaster year.</p>
<p>And, although our dashboard does not include US data, we thought that comparing the San Diego COVID experience with the national average would be insightful.</p>
<h2>San Diego COVID experience vs the nation</h2>
<p>The following figure shows the 7-day moving average of daily new cases per 100,000, for both San Diego County and the entire US.<a href="#_ftn1" name="_ftnref1">[1]</a></p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" class="alignnone size-large wp-image-2049" src="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/04/San-Diego-vs-US-COVID-New-Cases.png?resize=1024%2C616&#038;ssl=1" alt="San Diego COVI-19" width="1024" height="616" srcset="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/04/San-Diego-vs-US-COVID-New-Cases.png?resize=1024%2C616&amp;ssl=1 1024w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/04/San-Diego-vs-US-COVID-New-Cases.png?resize=300%2C180&amp;ssl=1 300w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/04/San-Diego-vs-US-COVID-New-Cases.png?resize=768%2C462&amp;ssl=1 768w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/04/San-Diego-vs-US-COVID-New-Cases.png?w=1482&amp;ssl=1 1482w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /></p>
<p>As shown in the above figure, as a nation we have been through 3 waves with it being too soon to tell if the 4<sup>th</sup> wave has crested. San Diego County’s experience was generally similar <strong>except for the 4<sup>th</sup> wave</strong>.</p>
<h3>Wave #1</h3>
<p>The initial rise in daily new cases crested at a 7-day average of 10 per 100,000 for the US on April 12, 2020. San Diego’s first wave crested about a week earlier on April 4<sup>th</sup> at about 4 per 100,000.</p>
<p>The US new case 7-day average fell to 6 per 100,000 by mid-June. San Diego’s briefly fell a bit but then rose back up to a daily rate of 3 to 4 per 100,000 till mid-June.</p>
<p>So, San Diego did not really experience the same recovery from the first wave as the US.</p>
<h3>Wave #2</h3>
<p>For both the US and San Diego, the second, much larger wave began in mid-June 2020. The US new case rate increased from a 7-day average of about 6 per 100,000 to a peak of 21 per 100,000 on July 23<sup>rd</sup>. San Diego’s rate increased from about 4 per 100,000 to a peak of 16 per 100,000 on July 2<sup>nd</sup>.</p>
<p>Both the US and San Diego new case rates declined through the end of the summer. The US new case rate bottomed at a 7-day average of 13 per 100,000 on September 13<sup>th</sup>. San Diego’s bottomed at the end of August at about 8 per 100,000 and stayed essentially flat till mid-October.</p>
<h3>Wave #3</h3>
<p>The US third wave began in mid-September – a full month before San Diego was hit. Rising from a 7-day average low of about 11 per 100,000, the US new case rate increased throughout the fall and early winter, peaking at 76 cases per 100,000 on January 11, 2021.</p>
<p>San Diego’s third wave began in mid-October. From a 7-day average low of about 8 new cases per 100,000 on October 20, <strong>the new case rate peaked at a high of 109 per 100,000</strong> on the same day that this third wave peaked for the country.</p>
<p>As the figure shows, San Diego (as well as Los Angeles) suffered much higher new case rates than the national average.</p>
<p>But daily new cases started to decline just as steeply as they increased. San Diego’s 7-day average new case rate fell from this high of 109 to around 7 per 100,000 by April 20, 2021.</p>
<p>The US new case rate fell as well, from a 7-day average of 76 to about 16 by March 19, 2021.</p>
<h3>Wave #4</h3>
<p>Until this point, San Diego’s experience, though different in severity, matched the general pattern of the country. However, a 4<sup>th</sup> US wave began in mid-March 2021, driven by new outbreaks in Michigan and New Jersey. It is too soon to tell if this 4<sup>th</sup> wave has crested but the most recent peak is a 7-day average of 21 new cases per 100,000 on April 13<sup>th</sup>.</p>
<p>San Diego has been fortunate to escape this 4<sup>th</sup> wave (so far).</p>
<p>Fingers crossed&#8230;</p>
<p>&nbsp;</p>
<p><a href="#_ftnref1" name="_ftn1">[1]</a> San Diego new case data are from the San Diego County <a href="https://www.sandiegocounty.gov/content/sdc/hhsa/programs/phs/community_epidemiology/dc/2019-nCoV/status.html"><strong>Health Department</strong></a>. US new case data are from the <a href="https://covid.cdc.gov/covid-data-tracker/#trends_dailytrendscases"><strong>CDC</strong></a>.  The 7-day moving average is the average of the current and preceding 6 days. <a href="https://www.census.gov/quickfacts/fact/table/US,sandiegocountycalifornia,CA/PST045219"><strong>2019 population</strong></a> is used to normalize case counts so we can compare San Diego with the nation.</p>
<p>The post <a href="https://www.kddanalytics.com/san-diego-and-covid-19-a-very-challenging-year/">San Diego and COVID-19 &#8230; A Very Challenging Year</a> appeared first on <a href="https://www.kddanalytics.com">KDD Analytics</a>.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">2048</post-id>	</item>
		<item>
		<title>Efficacy vs Effectiveness of the COVID Vaccines…&#8221;tomato, tomahto&#8221;?</title>
		<link>https://www.kddanalytics.com/covid-vaccine-efficacy-effectiveness/</link>
		
		<dc:creator><![CDATA[KDD]]></dc:creator>
		<pubDate>Thu, 08 Apr 2021 18:15:48 +0000</pubDate>
				<category><![CDATA[Categorical Data Analysis]]></category>
		<category><![CDATA[Data Analysis]]></category>
		<category><![CDATA[Data Analytics Methods]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[categorical data]]></category>
		<category><![CDATA[contingency table]]></category>
		<category><![CDATA[COVID]]></category>
		<category><![CDATA[efficacy]]></category>
		<category><![CDATA[relative risk]]></category>
		<category><![CDATA[statistics]]></category>
		<guid isPermaLink="false">https://www.kddanalytics.com/?p=1939</guid>

					<description><![CDATA[<p>You like potato and I like potahto You like tomato and I like tomahto Potato, potahto, tomato, tomahto Let&#8217;s call the whole thing off But oh, if we call the whole thing off Then we must part And oh, if we ever part then that might break my heart &#8212;Ira Gershwin The eye-popping efficacy rates&#8230;</p>
<p>The post <a href="https://www.kddanalytics.com/covid-vaccine-efficacy-effectiveness/">Efficacy vs Effectiveness of the COVID Vaccines…&#8221;tomato, tomahto&#8221;?</a> appeared first on <a href="https://www.kddanalytics.com">KDD Analytics</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p style="text-align: center;"><em>You like potato and I like potahto</em><br />
<em>You like tomato and I like tomahto</em><br />
<em>Potato, potahto, tomato, tomahto</em><br />
<em>Let&#8217;s call the whole thing off</em></p>
<p style="text-align: center;"><em>But oh, if we call the whole thing off</em><br />
<em>Then we must part</em><br />
<em>And oh, if we ever part</em><br />
<em>then that might break my heart</em></p>
<p style="text-align: center;"><em>&#8212;Ira Gershwin</em></p>
<p>The eye-popping efficacy rates reported for the Moderna (<a href="https://www.cdc.gov/coronavirus/2019-ncov/vaccines/different-vaccines/Moderna.html"><strong>94%</strong></a>), Pfizer (<a href="https://www.cdc.gov/coronavirus/2019-ncov/vaccines/different-vaccines/Pfizer-BioNTech.html"><strong>95%</strong></a>) and, to a lesser extent, the Johnson &amp; Johnson (<a href="https://www.cdc.gov/coronavirus/2019-ncov/vaccines/different-vaccines/janssen.html"><strong>66%</strong></a>) COVID-19 vaccines have undoubtedly not escaped your attention.</p>
<p>But what is vaccine <em><strong>efficacy</strong></em> and how is it calculated? And how does it differ from vaccine <em><strong>effectiveness</strong></em>?</p>
<h2>Moderna vaccine efficacy</h2>
<p>First, consider efficacy. Using Moderna’s reported clinical trial results as an example, we see that it is a straightforward calculation.</p>
<p>Moderna <strong><a href="https://www.modernatx.com/covid19vaccine-eua/providers/clinical-trial-data">reported</a></strong> results from it&#8217;s COVID-19 vaccine trial in November 2020. The results are shown below in a 2&#215;2 “<a href="https://en.wikipedia.org/wiki/Contingency_table"><strong><em>contingency</em></strong></a>” or “<strong><em>cross-tabulation</em></strong>” table. The columns show the number of subjects who were infected (or not); the rows show the number who received the vaccine (or the placebo). And the cells show the intersection of those two events.</p>
<h4><img data-recalc-dims="1" loading="lazy" decoding="async" class="size-full wp-image-1954 aligncenter" src="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/04/Moderna-COVID-Clinical-Trial-Contingency-Table-v2.png?resize=458%2C164&#038;ssl=1" alt="Efficacy of Moderna COVID vaccine" width="458" height="164" srcset="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/04/Moderna-COVID-Clinical-Trial-Contingency-Table-v2.png?w=458&amp;ssl=1 458w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2021/04/Moderna-COVID-Clinical-Trial-Contingency-Table-v2.png?resize=300%2C107&amp;ssl=1 300w" sizes="auto, (max-width: 458px) 100vw, 458px" /></h4>
<h3>Relative risk</h3>
<p>The <strong>strength of the association, </strong>or the<strong><em> effect size</em>,</strong> between receiving the vaccine and not getting infected is measured by the <em><strong>relative risk</strong></em>.</p>
<p>The <em><strong>probability</strong></em> or <em><strong>risk</strong></em> of a vaccinated subject being infected is 0.08%. That is, (11 / 14,134) or the expected number of events / sum of events and non-events. For a subject receiving the placebo, the probability of infection is higher at 1.31% (i.e., 185 / 14,073).</p>
<p>So, using the placebo group as the reference group, the <em><strong>relative risk</strong></em> is (11 / 14,134) / (185 / 14,073) or 0.059.<a href="#_ftn1" name="_ftnref1">[1]</a></p>
<p>In other words, <strong>the risk of a vaccinated person being infected is 94.1% <span style="text-decoration: underline;">lower</span> compared to a subject who received the placebo</strong> (i.e., (1 – 0.059) * 100)).</p>
<p>It is this calculation of 94.1% that was reported by Moderna as the vaccine&#8217;s <strong><em><a href="https://www.cdc.gov/csels/dsepd/ss1978/lesson3/section6.html">efficacy rate</a></em>.</strong><a href="#_ftn2" name="_ftnref2">[2]</a></p>
<h2>Vaccine effectiveness</h2>
<p>So, what about <em><strong>vaccine effectiveness</strong></em>? The term effectiveness refers to <strong>how the vaccine performs in the real world</strong>.  Efficacy refers to how the vaccine performs under “optimal” conditions of a clinical trial.</p>
<p>Clinical trials are based on a sample of subjects who may not be fully representative of the general population (e.g., all <a href="https://www.verywellhealth.com/comorbidity-5081615"><strong>comorbidities</strong></a> are not controlled for). In addition, the COVID strain that existed in the population during the clinical trial period may not be the same that occurs when the vaccine is released. Also, vaccine transportation, storage and delivery may differ from the more controlled environment of the clinical trial. Thus, the effectiveness of the vaccine may be different from what was found during the clinical trial.</p>
<h3>Studies on COVID vaccine effectiveness</h3>
<p>So, do we have any data yet on the real-world effectiveness of the COVID vaccines? It takes time to collect data, but <strong>we do have some indication that vaccine effectiveness is very high.</strong></p>
<p>An early <a href="https://www.nejm.org/doi/full/10.1056/NEJMoa2101765"><strong>study</strong></a> appeared February 24, 2021 in the New England Journal of Medicine.  The study examined the Pfizer vaccine performance in Israel. The sample was matched data from over 1 million people, half who were vaccinated between December 2020 to February 2021 and half who were not. The results of the study suggest a <strong>symptomatic infection effectiveness rate of 94%</strong> 7+ days after the second dose.</p>
<p>A more recent <a href="https://www.cdc.gov/mmwr/volumes/70/wr/mm7013e3.htm"><strong>study</strong></a> released by the CDC on April 2 examined both the Pfizer and Moderna vaccines.  This study used US data from December 2020 to March 2021. The sample consisted of 3,950 health care personnel, first responders, and other front-line workers.  The study found that the <strong>vaccines were 90% effective against COVID infection</strong> 14+ days after the second dose. <strong>Even 14+ days after the <span style="text-decoration: underline;">first</span> dose the vaccines were 80% effective.</strong></p>
<p>As a point of comparison, according to the <a href="https://www.cdc.gov/flu/vaccines-work/vaccineeffect.htm"><strong>CDC</strong></a>, effectiveness of the annual flu vaccination ranges between 40 and 60%.<a href="#_ftn3" name="_ftnref3">[3]</a></p>
<p><strong>So, the effectiveness rate, after 2 doses of the Pfizer and Moderna vaccines, appears to be very close in magnitude to the efficacy rate.</strong></p>
<p>Very good news indeed!</p>
<p>Tomato, tomahto?</p>
<p>&nbsp;</p>
<p><a href="#_ftnref1" name="_ftn1">[1]</a> A relative risk ratio of 1.0 would mean no difference in effect between the treatment types.</p>
<p><a href="#_ftnref2" name="_ftn2">[2]</a> A summary of efficacy rates across the range of current COVID vaccines can be found <a href="http://www.healthdata.org/covid/covid-19-vaccine-efficacy-summary">here</a>.</p>
<p><a href="#_ftnref3" name="_ftn3">[3]</a> One reason for the range is that the flu strain that is in circulation can differ from what was predicted when the annual flu vaccine was developed earlier in the year.</p>
<p>The post <a href="https://www.kddanalytics.com/covid-vaccine-efficacy-effectiveness/">Efficacy vs Effectiveness of the COVID Vaccines…&#8221;tomato, tomahto&#8221;?</a> appeared first on <a href="https://www.kddanalytics.com">KDD Analytics</a>.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">1939</post-id>	</item>
		<item>
		<title>How to Visualize Changing Recession Start Date Forecasts</title>
		<link>https://www.kddanalytics.com/visualize-revisions-recession-start-date-forecasts/</link>
		
		<dc:creator><![CDATA[KDD]]></dc:creator>
		<pubDate>Sat, 05 Jan 2019 22:19:59 +0000</pubDate>
				<category><![CDATA[Data Visualization]]></category>
		<category><![CDATA[Forecasting]]></category>
		<category><![CDATA[Tableau]]></category>
		<category><![CDATA[data visualization]]></category>
		<category><![CDATA[forecasting]]></category>
		<category><![CDATA[machine learning]]></category>
		<guid isPermaLink="false">https://www.kddanalytics.com/?p=1515</guid>

					<description><![CDATA[<p>In case you missed it, we are in a recession. According to Intensity’s latest US recession start date forecast, there is a 50% probability of a recession starting sometime in the January to February 2019 period.  And a 97% probability of it starting sometime within the next 6 months. Their “point estimate” of a recession&#8230;</p>
<p>The post <a href="https://www.kddanalytics.com/visualize-revisions-recession-start-date-forecasts/">How to Visualize Changing Recession Start Date Forecasts</a> appeared first on <a href="https://www.kddanalytics.com">KDD Analytics</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>In case you missed it, <strong>we are in a recession</strong>.</p>
<p>According to <a href="https://intensity.com/news/intensity-recession-forecast-january-3-2019" target="_blank" rel="noopener"><strong>Intensity’s latest US recession start date forecast</strong></a>, there is a 50% probability of a recession starting sometime in the January to February 2019 period.  And a 97% probability of it starting sometime within the next 6 months.</p>
<p>Their “<a href="https://en.wikipedia.org/wiki/Point_estimation" target="_blank" rel="noopener"><strong>point estimate</strong></a>” of a recession start is January 2019.</p>
<p><strong>Like, as in, right now!</strong></p>
<p>If true, it will take awhile for the impacts to start showing up in the official government statistics.  But the stock market sell-off last quarter may be a harbinger of things to come.</p>
<p><a href="https://intensity.com/" target="_blank" rel="noopener"><strong>Intensity</strong></a>, an economics and data science firm based in San Diego, CA, developed and back-tested a machine learning prediction algorithm for its clients.  The firm started releasing a monthly forecast of the next US recession start date to the public starting in March 2018.</p>
<p>Over the course of the last 11 months, it has been interesting following the updates to their forecast as economic conditions changed.</p>
<p>Intuitively, one would expect that the forecast would “settle down”, the closer the expected start date became.</p>
<p>And it got me thinking about what the best way is to visualize these changing forecasts.</p>
<h3>Visualizing Forecast Updates Over Time</h3>
<p>The forecasted recession start date is not linear with time.  For example, in March 2018, the next recession was forecasted by Intensity to start in April 2019.  But in April 2018, the forecast was revised, and the recession was to start <strong>6 months earlier</strong> in October 2018.</p>
<p>Plotting the month of the forecast on the x-axis and the forecasted month of the recession start on the y-axis yields a “traditional time series” view as shown below.</p>
<p>&nbsp;</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" class="alignnone size-large wp-image-1532" src="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2019/01/Intensity-forecast-shown-horizontally.png?resize=1024%2C727&#038;ssl=1" alt="Intensity recession forecast - shown horizontally" width="1024" height="727" srcset="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2019/01/Intensity-forecast-shown-horizontally.png?resize=1024%2C727&amp;ssl=1 1024w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2019/01/Intensity-forecast-shown-horizontally.png?resize=300%2C213&amp;ssl=1 300w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2019/01/Intensity-forecast-shown-horizontally.png?resize=768%2C545&amp;ssl=1 768w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2019/01/Intensity-forecast-shown-horizontally.png?w=1332&amp;ssl=1 1332w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /></p>
<p>As time progresses from left to right, we can see the forecasted recession start date fluctuating up and down, settling on January 2019, the most recent forecasted start date.</p>
<p>However, another way to visualize this is to show the progression of time vertically, from bottom to top.  In this case the forecasted recession start date would fluctuate horizontally, left and right, as shown below.</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" class="alignnone size-large wp-image-1528" src="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2019/01/Intensity-Forecast-shown-vertically-1.png?resize=1024%2C734&#038;ssl=1" alt="Intensity Recession Forecast - shown vertically" width="1024" height="734" srcset="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2019/01/Intensity-Forecast-shown-vertically-1.png?resize=1024%2C734&amp;ssl=1 1024w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2019/01/Intensity-Forecast-shown-vertically-1.png?resize=300%2C215&amp;ssl=1 300w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2019/01/Intensity-Forecast-shown-vertically-1.png?resize=768%2C551&amp;ssl=1 768w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2019/01/Intensity-Forecast-shown-vertically-1.png?w=1325&amp;ssl=1 1325w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /></p>
<p>I don’t know about you, but I find this second view more appealing.  Maybe it is the old economist in me, trained on the <a href="https://en.wikipedia.org/wiki/Phillips_curve" target="_blank" rel="noopener"><strong>Phillips Curve</strong></a> in graduate school.  But for me, the vertical, “up-down” orientation makes the variation in the forecasted recession start date “pop” more than in the horizontal, “left-to-right” view.</p>
<h3>So, Recession in 2019?</h3>
<p>It will be very interesting to see if Intensity sticks to its January 2019 point estimate.  Prior to the unexpectedly positive <a href="https://www.marketwatch.com/amp/story/guid/C82CF1F6-0F91-11E9-835D-C91F740D86E0" target="_blank" rel="noopener"><strong>December 2018 jobs report</strong></a><strong>,</strong> the consensus seemed to be a recession starting some time in 2019 or 2020.  For example, <a href="https://news.yahoo.com/gary-shilling-sees-66-chance-041710124.html" target="_blank" rel="noopener"><strong>Gary Shilling</strong></a> recently tossed his hat into the recession ring with a predicted 66% chance of a recession in 2019.</p>
<p>However, the positive jobs report apparently has many economists now <a href="https://www.washingtonpost.com/business/economy/us-jobs-data-boosts-wall-street-and-reassures-investors-about-economy/2019/01/04/b910ac92-105b-11e9-8938-5898adc28fa2_story.html?noredirect=on&amp;utm_term=.7685c12bcb54" target="_blank" rel="noopener"><strong>softening their stance</strong></a> on a recession this year.  And there is talk of policy makers being able to <strong><a href="https://www.csmonitor.com/Business/2019/0102/Recession-is-a-risk-in-2019.-But-maybe-one-that-policymakers-can-avoid" target="_blank" rel="noopener">sidestep a recession</a></strong>.</p>
<p>Only time will tell…so stay tuned!</p>
<h3>Plotting Ordered Times Series in Tableau</h3>
<p>By the way, these charts were made in <a href="https://www.tableau.com/" target="_blank" rel="noopener"><strong>Tableau</strong></a>.  And it was not as straight forward as flipping the axes to get the vertical view.  Tableau’s default inclination is to “connect the dots” from left to right when time is involved.</p>
<p>Fortunately, there is an easy way to get Tableau to connect the dots vertically.  This makes use of the <a href="https://onlinehelp.tableau.com/current/pro/desktop/en-us/viewparts_marks_markproperties.htm#PathProp" target="_blank" rel="noopener"><strong>Path property</strong></a> in the Marks card.  I simply added a field to my raw data that indicated the order of my data, which, of course was calendar order.</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" class="size-full wp-image-1518 aligncenter" src="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2019/01/Path-Order.png?resize=638%2C362&#038;ssl=1" alt="Tableau data input - Path Order" width="638" height="362" srcset="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2019/01/Path-Order.png?w=638&amp;ssl=1 638w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2019/01/Path-Order.png?resize=300%2C170&amp;ssl=1 300w" sizes="auto, (max-width: 638px) 100vw, 638px" /></p>
<p>Then dropping this field on the Path property in the Marks card tells Tableau to connect the dots (or “Marks” in Tableau-speak) in this order.  With the date of the forecast on the vertical, y-axis, Tableau connects the dots from bottom to top.</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" class="alignnone size-large wp-image-1529" src="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2019/01/Tableau-Path-Order.png?resize=1024%2C809&#038;ssl=1" alt="Tableau Path Property on Marks Card" width="1024" height="809" srcset="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2019/01/Tableau-Path-Order.png?resize=1024%2C809&amp;ssl=1 1024w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2019/01/Tableau-Path-Order.png?resize=300%2C237&amp;ssl=1 300w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2019/01/Tableau-Path-Order.png?resize=768%2C607&amp;ssl=1 768w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2019/01/Tableau-Path-Order.png?w=1255&amp;ssl=1 1255w" sizes="auto, (max-width: 1000px) 100vw, 1000px" /></p>
<p>Very slick!</p>
<a class="dpsp-click-to-tweet dpsp-style-1" href="https://twitter.com/intent/tweet?text=US+Recession+starting+January+2019%3F&url=https%3A%2F%2Fwww.kddanalytics.com%2Fvisualize-revisions-recession-start-date-forecasts%2F"><div class="dpsp-click-to-tweet-content">US Recession starting January 2019?</div><div class="dpsp-click-to-tweet-footer"><span class="dpsp-click-to-tweet-cta"><span>Click to Tweet</span><i class="dpsp-network-btn dpsp-twitter"><span class="dpsp-network-icon"></span></i></span></div></a>
<p><!-- Protanopia filter --> <!-- Deuteranopia filter --> <!-- Tritanopia filter --></p>
<p><!-- Protanopia filter --> <!-- Deuteranopia filter --> <!-- Tritanopia filter --></p>
<p><!-- Protanopia filter --> <!-- Deuteranopia filter --> <!-- Tritanopia filter --></p>
<p><!-- Protanopia filter --> <!-- Deuteranopia filter --> <!-- Tritanopia filter --></p>
<p><!-- Protanopia filter --> <!-- Deuteranopia filter --> <!-- Tritanopia filter --></p>
<p><!-- Protanopia filter --> <!-- Deuteranopia filter --> <!-- Tritanopia filter --></p>
<p><!-- Protanopia filter --> <!-- Deuteranopia filter --> <!-- Tritanopia filter --></p>
<p><!-- Protanopia filter --> <!-- Deuteranopia filter --> <!-- Tritanopia filter --></p>
<p><!-- Protanopia filter -->    <!-- Deuteranopia filter -->    <!-- Tritanopia filter --></p>
<p>The post <a href="https://www.kddanalytics.com/visualize-revisions-recession-start-date-forecasts/">How to Visualize Changing Recession Start Date Forecasts</a> appeared first on <a href="https://www.kddanalytics.com">KDD Analytics</a>.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">1515</post-id>	</item>
		<item>
		<title>Practical Time Series Forecasting &#8211; Bounding Uncertainty</title>
		<link>https://www.kddanalytics.com/practical-time-series-forecasting-forecast-uncertainty/</link>
		
		<dc:creator><![CDATA[KDD]]></dc:creator>
		<pubDate>Mon, 12 Feb 2018 03:46:22 +0000</pubDate>
				<category><![CDATA[Data Analytics Methods]]></category>
		<category><![CDATA[Econometrics]]></category>
		<category><![CDATA[Forecasting]]></category>
		<category><![CDATA[Time Series]]></category>
		<category><![CDATA[confidence interval]]></category>
		<category><![CDATA[forecast interval]]></category>
		<category><![CDATA[forecast uncertainty]]></category>
		<guid isPermaLink="false">http://www.kddanalytics.com/?p=1356</guid>

					<description><![CDATA[<p>“A good forecaster is not smarter than everyone else, he merely has his ignorance better organized.” ― Anonymous Predicting the future is an exercise in probability rather than certainty. As we have mentioned several times over the course of these articles, your forecast model will be wrong. It is just a matter of how useful&#8230;</p>
<p>The post <a href="https://www.kddanalytics.com/practical-time-series-forecasting-forecast-uncertainty/">Practical Time Series Forecasting &#8211; Bounding Uncertainty</a> appeared first on <a href="https://www.kddanalytics.com">KDD Analytics</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>“<em>A good forecaster is not smarter than everyone else, he merely has his ignorance better organized</em>.”<br />
― <strong>Anonymous</strong></p>
<p>Predicting the future is an exercise in probability rather than certainty. As we have mentioned several times over the course of these articles, <strong>your forecast model will be wrong</strong>.</p>
<p><strong> It is just a matter of how useful it might be.</strong></p>
<p>A time series model will <strong>forecast a path</strong> through the forecast horizon, a “point forecast.” But <strong>this path is just one of the paths</strong> your forecast can take based on your estimated model.</p>
<p>Providing a sense of the <strong>uncertainty surrounding your forecast</strong> is an essential part of your job as a forecaster.</p>
<h3>Forecast intervals</h3>
<p>The standard approach is to provide the “<a href="https://en.wikipedia.org/wiki/Prediction_interval" target="_blank" rel="noopener"><strong>forecast interval</strong></a>” for your forecast.</p>
<p>Typically, this is cast in terms of a 95% prediction interval. That is, 95 times out of 100, the actual value will fall within the specified range. (Note that there is a <a href="https://www.ma.utexas.edu/users/mks/statmistakes/CIvsPI.html" target="_blank" rel="noopener"><strong>difference between</strong></a> a &#8220;confidence&#8221; interval and a &#8220;forecast&#8221; interval.)</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" class="size-full wp-image-1358 aligncenter" src="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2017/12/Forecast-Interval.png?resize=603%2C371&#038;ssl=1" alt="Forecast Interval" width="603" height="371" srcset="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2017/12/Forecast-Interval.png?w=603&amp;ssl=1 603w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2017/12/Forecast-Interval.png?resize=300%2C185&amp;ssl=1 300w" sizes="auto, (max-width: 603px) 100vw, 603px" /></p>
<h3>Sources of forecast uncertainty</h3>
<p>There are at least <a href="http://www.eviews.com/help/helpintro.html#page/content%2FForecast-Forecast_Basics.html%23ww181365" target="_blank" rel="noopener"><strong>two sources</strong></a> of forecast uncertainty over the forecast horizon.</p>
<p>The <strong>first results from our ignorance of what the model’s error will be in the forecast horizon</strong>. So, we must rely on how well the model did in the recalibration sample (estimation + holdout) as an estimate.</p>
<p>The <strong>second source of uncertainty results from the model’s coefficients </strong>(or parameters)<strong> being estimates of their true values</strong>. As estimates, they have their own “confidence” interval.</p>
<p>As a result, <strong>the forecast interval can be quite large</strong> (as shown above). And, due to error compounding over time, the <strong>forecast interval widens</strong> the further into the forecast horizon you go.</p>
<p>In our example above, during the <strong>first month</strong> of the forecast horizon, the forecast interval is <strong>plus or minus 0.63%</strong> of the forecasted value. By <strong>month 6</strong>, this spread widens to <strong>plus or minus 2.95%</strong>.</p>
<p>Even accounting for forecast error and parameter uncertainty, these forecast intervals may still be <a href="https://robjhyndman.com/hyndsight/narrow-pi/" target="_blank" rel="noopener"><strong>too narrow</strong></a>.</p>
<h3>What about meta forecasts?</h3>
<p>In an <a href="https://www.kddanalytics.com/practical-time-series-forecasting-meta-models/" target="_blank" rel="noopener"><strong>earlier article</strong></a> we discussed <strong>combining forecasts into a meta forecast</strong>. The <strong>challenge</strong> in terms of a <strong>meta prediction interval</strong> is that it is <strong>not a simple matter to combine the prediction intervals of the constituents’ forecasts</strong>.</p>
<p><strong>One approach</strong> is to simply <strong>show the extreme upper and lower forecast paths</strong> along with the meta forecast path, which will lie somewhere between the two extremes.</p>
<p>And then to <strong>caution the consumer of your forecast</strong> that this is just to give a sense of the possible forecast range, which <strong>will likely be too narrow</strong> (since the upper and lower forecast will each have their own prediction interval).</p>
<h3>Probability-based assessment of forecast uncertainty</h3>
<p>Another approach is to <strong>couch your forecast uncertainty </strong><strong>in terms of a probability</strong>.</p>
<p>For example, based on your SALES forecast, <strong>what are the chances of hitting a certain level of sales by a certain date</strong>? If you are forecasting procurement needs for a warehouse, <strong>what is the chance of running out of inventory by a certain date</strong>? If you are a macroeconomist forecasting GDP, <strong>what are the chances of the economy falling into a recession by a certain date</strong>?</p>
<p><strong>Suppose</strong> you are tasked with forecasting daily SALES over the next year.</p>
<p><strong>Management has targeted a certain level of SALES and wants to know when that target will be hit.</strong> You can use the forecast uncertainty produced by your model to generate the following chart:</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" class="size-full wp-image-1361 aligncenter" src="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2017/12/Forecast-Risk-Curve.png?resize=603%2C372&#038;ssl=1" alt="Forecast risk curve" width="603" height="372" srcset="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2017/12/Forecast-Risk-Curve.png?w=603&amp;ssl=1 603w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2017/12/Forecast-Risk-Curve.png?resize=300%2C185&amp;ssl=1 300w" sizes="auto, (max-width: 603px) 100vw, 603px" />The vertical axis is the chance of hitting the SALES target by a certain date (in this case, days into the next year). So, <strong>160 days into the year, there is a 10% chance of hitting the sales target.</strong></p>
<p><strong>By day 192</strong>, a month later, the <strong>chance has grown to 30%</strong>. And <strong>by day 218, there is a 50/50 chance</strong> the sales target will be reached.</p>
<p>Stating these chances in terms of odds may be an easier way to present this:</p>
<p><strong> By day 160</strong>, the odds of hitting the target would be <strong>9 to 1</strong>. By <strong>day 192</strong> it would be a little over <strong>2 to 1</strong>. And by <strong>day 218</strong>, it would be <strong>1 to 1…a flip of the coin.</strong></p>
<h3>Bottom line</h3>
<p>Uncertainty is a fact of life and your forecasts will be “wrong.”</p>
<p>But quantifying how wrong they can be will go a long way towards making them “useful.”</p>
<a class="dpsp-click-to-tweet dpsp-style-1" href="https://twitter.com/intent/tweet?text=quantify+forecast+uncertainty+and+make+your+forecast+useful&url=https%3A%2F%2Fwww.kddanalytics.com%2Fpractical-time-series-forecasting-forecast-uncertainty%2F"><div class="dpsp-click-to-tweet-content">quantify forecast uncertainty and make your forecast useful</div><div class="dpsp-click-to-tweet-footer"><span class="dpsp-click-to-tweet-cta"><span>Click to Tweet</span><i class="dpsp-network-btn dpsp-twitter"><span class="dpsp-network-icon"></span></i></span></div></a>
<p><a href="https://www.kddanalytics.com/practical-time-series-forecasting-introduction/" target="_blank" rel="noopener"><strong>Part 1 &#8211; Practical Time Series Forecasting &#8211; Introduction</strong></a></p>
<p><a href="https://www.kddanalytics.com/practical-time-series-forecasting-basics/" target="_blank" rel="noopener"><strong>Part 2 &#8211; Practical Time Series Forecasting &#8211; Some Basics</strong></a></p>
<p><a href="https://www.kddanalytics.com/practical-time-series-forecasting-useful-models/" target="_blank" rel="noopener"><strong>Part 3 &#8211; Practical Time Series Forecasting &#8211; Potentially Useful Models</strong></a></p>
<p><a href="https://www.kddanalytics.com/practical-time-series-forecasting-data-science-taxonomy/" target="_blank" rel="noopener"><strong>Part 4 &#8211; Practical Time Series Forecasting &#8211; Data Science Taxonomy</strong></a></p>
<p><a href="https://www.kddanalytics.com/practical-time-series-forecasting-holdout-sample/" target="_blank" rel="noopener"><strong>Part 5 &#8211; Practical Time Series Forecasting &#8211; Know When to Hold &#8217;em</strong></a></p>
<p><a href="https://www.kddanalytics.com/practical-time-series-forecasting-what-makes-a-useful-model/" target="_blank" rel="noopener"><strong>Part 6 &#8211; Practical Time Series Forecasting &#8211; What Makes a Model Useful?</strong></a></p>
<p><a href="https://www.kddanalytics.com/practical-time-series-forecasting-deterministic-stochastic-trend/" target="_blank" rel="noopener"><strong>Part 7 &#8211; Practical Time Series Forecasting &#8211; To Difference or Not to Difference</strong></a></p>
<p><a href="https://www.kddanalytics.com/practical-times-series-forecasting-rolling-holdout-sample/" target="_blank" rel="noopener"><strong>Part 8 &#8211; Practical Time Series Forecasting &#8211; Know When to Roll &#8217;em</strong></a></p>
<p><a href="https://www.kddanalytics.com/practical-time-series-forecasting-meta-models/" target="_blank" rel="noopener"><strong>Part 9 &#8211; Practical Time Series Forecasting &#8211; Meta Models</strong></a></p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>The post <a href="https://www.kddanalytics.com/practical-time-series-forecasting-forecast-uncertainty/">Practical Time Series Forecasting &#8211; Bounding Uncertainty</a> appeared first on <a href="https://www.kddanalytics.com">KDD Analytics</a>.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">1356</post-id>	</item>
		<item>
		<title>Practical Time Series Forecasting – Meta Models</title>
		<link>https://www.kddanalytics.com/practical-time-series-forecasting-meta-models/</link>
		
		<dc:creator><![CDATA[KDD]]></dc:creator>
		<pubDate>Mon, 05 Feb 2018 01:47:38 +0000</pubDate>
				<category><![CDATA[Data Analytics Methods]]></category>
		<category><![CDATA[Econometrics]]></category>
		<category><![CDATA[Forecasting]]></category>
		<category><![CDATA[Time Series]]></category>
		<category><![CDATA[forecast error]]></category>
		<category><![CDATA[MAPE]]></category>
		<category><![CDATA[meta forecast]]></category>
		<category><![CDATA[MPE]]></category>
		<category><![CDATA[regression]]></category>
		<category><![CDATA[weighting]]></category>
		<guid isPermaLink="false">http://www.kddanalytics.com/?p=1331</guid>

					<description><![CDATA[<p>“There are two kinds of forecasters: those who don’t know, and those who don’t know they don’t know.” ― John Kenneth Galbraith After an extensive model building and vetting process, along the lines we previously discussed here and here, the practical forecaster may still be left with several strong performing models. These models perform similarly&#8230;</p>
<p>The post <a href="https://www.kddanalytics.com/practical-time-series-forecasting-meta-models/">Practical Time Series Forecasting – Meta Models</a> appeared first on <a href="https://www.kddanalytics.com">KDD Analytics</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>“<em>There are two kinds of forecasters: those who don’t know, and those who don’t know they don’t know.</em>”<br />
― <a href="https://en.wikipedia.org/wiki/John_Kenneth_Galbraith" target="_blank" rel="noopener"><strong>John Kenneth Galbraith</strong></a></p>
<p>After an extensive model building and vetting process, along the lines we previously discussed <strong><a href="https://www.kddanalytics.com/practical-time-series-forecasting-holdout-sample/" target="_blank" rel="noopener">here</a></strong> and <a href="https://www.kddanalytics.com/practical-time-series-forecasting-rolling-holdout-sample-analysis/" target="_blank" rel="noopener"><strong>here</strong></a>, the practical forecaster may still be left with several strong performing models.</p>
<p>These models perform similarly in the holdout sample tests. They retain their statistical properties when recalibrated on the full historical sample. But they <strong>yield different forecast paths over the forecast horizon</strong>.</p>
<p>Any one of the models could be easily defended. But the <strong>fact that the models yield different forecasts should make the forecaster pause</strong>.</p>
<h3>An example</h3>
<p>Below is an example of 3 short-run monthly forecasts:</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" class="size-full wp-image-1334 aligncenter" src="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2017/12/Example-of-Different-FC.png?resize=603%2C371&#038;ssl=1" alt="Examples of competiting forecasts" width="603" height="371" srcset="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2017/12/Example-of-Different-FC.png?w=603&amp;ssl=1 603w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2017/12/Example-of-Different-FC.png?resize=300%2C185&amp;ssl=1 300w" sizes="auto, (max-width: 603px) 100vw, 603px" /></p>
<p>The 3 models perform similarly in the holdout sample. One of the models is a least squares model. The other 2 are ARIMA models.</p>
<p>One model produces a <strong>steeply declining forecast</strong>. Another a <strong>slightly declining forecast</strong>. The third model produces an <strong>increasing forecast</strong>.</p>
<p>What should the forecaster do?</p>
<h3>How can this happen?</h3>
<p>Models are just that – models. They are abstractions from reality. And <strong>no single model will “fit” the holdout sample perfectly</strong>.</p>
<p>Two <strong>models</strong>, especially <strong>of different types</strong> (e.g. least squares vs. ARIMA), could have very <strong>similar holdout sample performance but differ</strong> dramatically <strong>in their forecast</strong> over the forecast horizon.</p>
<p>The holdout sample <strong>MAPE</strong> (<a href="https://www.kddanalytics.com/practical-time-series-forecasting-holdout-sample/" target="_blank" rel="noopener"><strong>mean absolute percentage error</strong></a>) could be very similar for these models. But the <strong>MAPE is an average error across the holdout sample</strong>. And the models could have arrived at their MAPEs by <strong>focusing on different aspects of the time series in the holdout sample.</strong></p>
<p>Projecting these differences into the forecast horizon can result in very different forecasts.</p>
<h3>Solutions</h3>
<p>When there is no clear “champion” model, one <strong>solution is to combine the forecasts into one</strong>. We call this a “<strong><a href="https://en.wikipedia.org/wiki/Metamodeling">meta</a></strong>” forecast.</p>
<p>There are several ways this can be accomplished.</p>
<h4>Checkpoint</h4>
<p><strong>But first</strong>, <strong>check</strong> to make sure the <strong>models</strong> to be combined are <strong>not “nested.”</strong> That is, <strong>one model is not a subset of another</strong>. If models are nested there usually is no advantage to combining their forecasts into a meta forecast.</p>
<p>In fact, a <strong>meta forecast will more likely be superior the greater the differences between the constituent models</strong>.</p>
<p>A meta forecast based on a least squares model and an ARIMA model will likely yield a smaller forecast error than that associated with either of the two models. However, if the two models were both least squares models, the superiority of a meta forecast might be questionable (<a href="https://www.amazon.com/Forecasting-Business-Economics-Econometrics-Mathematical/dp/0122951816"><strong>Granger, 1989</strong></a>).</p>
<h4>Solution 1</h4>
<p>The simplest approach to arriving at a meta forecast is to <strong>simply average the forecasts</strong> of the individual models.</p>
<p>This essentially assumes that <strong>each model’s forecast is equally important in the meta forecast </strong>(i.e. receives equal weighting). This is a quick and uncomplicated way to generate a meta forecast.</p>
<h4>Solution 2</h4>
<p>Another approach <strong>makes use</strong> of each model’s <strong>holdout sample performance measures of forecast accuracy and bias</strong>. A weighting for each model&#8217;s forecast can be calculated using each model’s <strong>MAPE</strong> and <strong>MPE</strong> (<a href="https://www.kddanalytics.com/practical-time-series-forecasting-holdout-sample/" target="_blank" rel="noopener"><strong>mean percentage error</strong></a>) relative to that of all the models combined.</p>
<p>The meta forecast would then be a <strong>weighted average</strong> of the individual model forecasts. Models with <strong>lower MAPE and MPE</strong> would receive <strong>higher weights and contribute more</strong> to the meta forecast.</p>
<h4>Solution 3</h4>
<p>A third approach is to use <strong>regression</strong> to estimate the weights.</p>
<p>Using the holdout sample, or if too small, the full sample, <strong>regress the actual value on the forecasted value from each model</strong>. The goal is to find a regression with <strong>no constant and all regression coefficients positive and statistically significant</strong>.</p>
<p>The regression <strong>coefficients should then sum very close to one</strong>. These <strong>coefficients then become the weights</strong> by which forecasts are combined into a meta forecast (see <a href="https://www.amazon.com/Business-Forecasting-ForecastX-Holton-Wilson/dp/0073373648/ref=sr_1_2?s=books&amp;ie=UTF8&amp;qid=1512008807&amp;sr=1-2&amp;keywords=wilson+keating+forecasting"><strong>Wilson and Keating</strong></a>).</p>
<h3>Back to our example</h3>
<p>The forecaster could go with candidate 3 since it &#8220;splits the difference.&#8221; However, the forecaster is still left with the task of defending why the other two equally plausible models were not chosen.</p>
<p>Alternatively, a meta forecast can be used. As an example, we created a <strong>simple average forecast</strong> across the 3 candidate models. As discussed above, this <strong>assumes an equal weighting across the 3 short-run forecasts</strong>. A more sophisticated approach would have been to estimate the weights using a regression approach.</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" class="size-full wp-image-1335 aligncenter" src="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2017/12/Example-of-a-meta-forecast.png?resize=605%2C371&#038;ssl=1" alt="Example of a meta forecast" width="605" height="371" srcset="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2017/12/Example-of-a-meta-forecast.png?w=605&amp;ssl=1 605w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2017/12/Example-of-a-meta-forecast.png?resize=300%2C184&amp;ssl=1 300w" sizes="auto, (max-width: 605px) 100vw, 605px" /></p>
<p>Not surprisingly, the meta forecast is quite like the essentially flat forecast of candidate 3 (which lies almost half way between candidate 1’s and 2’s forecast). <strong>But not all cases will be like this</strong>.</p>
<p>If a regression approach to estimating the weights was used, the meta forecast could be quite different from that of candidate 3.</p>
<p>Yes, the meta forecast will lie between the two forecast extremes. But the <strong>assumed or estimated weights will dictate where the meta forecast will lie</strong>.</p>
<h3>Bottom line</h3>
<p>Combining forecasts from equally strong models is intuitively appealing since <strong>each model has its strengths and weaknesses</strong>.</p>
<p><strong> Combining</strong> models’ forecasts in a <strong>complementary fashion</strong> should lead to <strong>more robust and accurate short-run forecasts</strong>.</p>
<a class="dpsp-click-to-tweet dpsp-style-1" href="https://twitter.com/intent/tweet?text=Combine+forecasts+into+a+meta+forecast+for+a+more+accurate+forecast&url=https%3A%2F%2Fwww.kddanalytics.com%2Fpractical-time-series-forecasting-meta-models%2F"><div class="dpsp-click-to-tweet-content">Combine forecasts into a meta forecast for a more accurate forecast</div><div class="dpsp-click-to-tweet-footer"><span class="dpsp-click-to-tweet-cta"><span>Click to Tweet</span><i class="dpsp-network-btn dpsp-twitter"><span class="dpsp-network-icon"></span></i></span></div></a>
<p><a href="https://www.kddanalytics.com/practical-time-series-forecasting-introduction/" target="_blank" rel="noopener"><strong>Part 1 &#8211; Practical Time Series Forecasting &#8211; Introduction</strong></a></p>
<p><a href="https://www.kddanalytics.com/practical-time-series-forecasting-basics/" target="_blank" rel="noopener"><strong>Part 2 &#8211; Practical Time Series Forecasting &#8211; Some Basics</strong></a></p>
<p><a href="https://www.kddanalytics.com/practical-time-series-forecasting-useful-models/" target="_blank" rel="noopener"><strong>Part 3 &#8211; Practical Time Series Forecasting &#8211; Potentially Useful Models</strong></a></p>
<p><a href="https://www.kddanalytics.com/practical-time-series-forecasting-data-science-taxonomy/" target="_blank" rel="noopener"><strong>Part 4 &#8211; Practical Time Series Forecasting &#8211; Data Science Taxonomy</strong></a></p>
<p><a href="https://www.kddanalytics.com/practical-time-series-forecasting-holdout-sample/" target="_blank" rel="noopener"><strong>Part 5 &#8211; Practical Time Series Forecasting &#8211; Know When to Hold &#8217;em</strong></a></p>
<p><a href="https://www.kddanalytics.com/practical-time-series-forecasting-what-makes-a-useful-model/" target="_blank" rel="noopener"><strong>Part 6 &#8211; Practical Time Series Forecasting &#8211; What Makes a Model Useful?</strong></a></p>
<p><a href="https://www.kddanalytics.com/practical-time-series-forecasting-deterministic-stochastic-trend/" target="_blank" rel="noopener"><strong>Part 7 &#8211; Practical Time Series Forecasting &#8211; To Difference or Not to Difference</strong></a></p>
<p><a href="https://www.kddanalytics.com/practical-times-series-forecasting-rolling-holdout-sample/" target="_blank" rel="noopener"><strong>Part 8 &#8211; Practical Time Series Forecasting &#8211; Know When to Roll &#8217;em</strong></a></p>
<p>&nbsp;</p>
<p>The post <a href="https://www.kddanalytics.com/practical-time-series-forecasting-meta-models/">Practical Time Series Forecasting – Meta Models</a> appeared first on <a href="https://www.kddanalytics.com">KDD Analytics</a>.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">1331</post-id>	</item>
		<item>
		<title>Practical Time Series Forecasting – Know When to Roll ‘em</title>
		<link>https://www.kddanalytics.com/practical-times-series-forecasting-rolling-holdout-sample/</link>
		
		<dc:creator><![CDATA[KDD]]></dc:creator>
		<pubDate>Mon, 29 Jan 2018 01:33:32 +0000</pubDate>
				<category><![CDATA[Data Analytics Methods]]></category>
		<category><![CDATA[Econometrics]]></category>
		<category><![CDATA[Forecasting]]></category>
		<category><![CDATA[Time Series]]></category>
		<category><![CDATA[forecast error]]></category>
		<category><![CDATA[holdout sample]]></category>
		<category><![CDATA[rolling analysis]]></category>
		<category><![CDATA[times series]]></category>
		<guid isPermaLink="false">http://www.kddanalytics.com/?p=1322</guid>

					<description><![CDATA[<p>“Prediction is very difficult, especially if it&#8217;s about the future.” ― Niels Bohr, physicist Holdout samples are a key component to estimating a “useful” forecasting model. Set aside data at least equal in length to your forecast horizon (“holdout sample”). Build your models on the remaining data (“modeling sample”). And compare the candidate models’ forecast&#8230;</p>
<p>The post <a href="https://www.kddanalytics.com/practical-times-series-forecasting-rolling-holdout-sample/">Practical Time Series Forecasting – Know When to Roll ‘em</a> appeared first on <a href="https://www.kddanalytics.com">KDD Analytics</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p><strong>“</strong><em>Prediction is very difficult, especially if it&#8217;s about the future.</em><strong>”<br />
― <a href="https://en.wikipedia.org/wiki/Niels_Bohr" target="_blank" rel="noopener">Niels Bohr</a></strong>, physicist</p>
<p><a href="https://www.kddanalytics.com/practical-time-series-forecasting-holdout-sample/" target="_blank" rel="noopener"><strong>Holdout samples</strong></a> are a key component to estimating a “useful” forecasting model. <strong>Set aside data at least equal in length to your forecast horizon</strong> (“holdout sample”). Build your models on the remaining data (“modeling sample”). And <strong>compare the candidate models’ forecast performance over the holdout sample.</strong></p>
<p>At a minimum, a single holdout sample should be used.</p>
<p>But to get a <strong>better sense of a model’s future performance, consider using multiple holdout samples</strong>.</p>
<p>This <strong>guards against</strong> basing your model on a <strong>holdout sample</strong> that is <strong>unrepresentative</strong> of the overall characteristics of the time series.</p>
<p>One way to achieve this is to use<strong> “rolling” holdout samples</strong>.</p>
<h3>Rolling analysis</h3>
<p>A <a href="https://link.springer.com/chapter/10.1007%2F978-0-387-32348-0_9" target="_blank" rel="noopener"><strong>rolling analysis</strong></a> of a time series is generally used to test a model’s stability. That is, <strong>are a model’s parameters stable across time</strong> or do they change, especially in a systematic way?</p>
<p>This is important for a forecasting model. We <strong>don’t want</strong> a forecasting model whose <strong>parameters</strong> are <strong>changing during the forecast horizon in an unexpected (i.e. unmodeled) manner.</strong></p>
<p>Suppose our forecast horizon is 6 months.</p>
<p><strong> Under a single holdout sample</strong>, we would <strong>set aside the last 6 months of data as the holdout sample</strong>. Then using the remaining data as the modeling sample, estimate models, forecast over the single holdout sample and compare the models’ performance.</p>
<p>This will help narrow down the pool of candidate models.</p>
<h4>Rolling holdout samples</h4>
<p>But under a rolling holdout approach, also called &#8220;<a href="http://otexts.org/fpp2/accuracy.html" target="_blank" rel="noopener"><strong>time series cross-validation</strong></a>,&#8221;  <strong>we would set aside a longer sample of data</strong>, say, the last 12 months. Then:</p>
<p><strong>Step 1:</strong>  Estimate a model and forecast over the <strong>first</strong> 6-months of this 12-month period (&#8220;roll 1&#8221;);</p>
<p><strong>Step 2:</strong>  Then add one 1 month to the tail-end of the estimation sample, recalibrate the model, and forecast over the subsequent 6-months (“roll 2”);</p>
<p><strong>Step 3:</strong>  Then add another month to the estimation sample, recalibrate and forecast over the subsequent 6-months (“roll 3”);</p>
<p><strong>Step 4:</strong>  Repeat until there are no more 6-month periods (&#8220;rolls&#8221;) remaining in the 12-month period.</p>
<p>So, <strong>in this example</strong>, we would have <strong>recalibrated our model 7 times</strong> (each with a modeling sample that is one additional month longer than the previous). And we would have <strong>made 7 forecasts over the rolling holdout periods</strong>.</p>
<p>The <strong>last &#8220;roll</strong>,&#8221; it turns out, <strong>is the same 6-month period</strong> we would have used <strong>under a single 6-month holdout sample case</strong>. So, we generate the stats for a standard single holdout sample during the course of this rolling holdout approach.</p>
<p>If we are examining multiple candidate models, this process can generate a lot of data. Below is an example of the rolling forecasts for one model.</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" class="size-full wp-image-1325 aligncenter" src="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2017/12/Rolling-Holdout-Samples.png?resize=561%2C547&#038;ssl=1" alt="Rolling Holdout Samples" width="561" height="547" srcset="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2017/12/Rolling-Holdout-Samples.png?w=561&amp;ssl=1 561w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2017/12/Rolling-Holdout-Samples.png?resize=300%2C293&amp;ssl=1 300w" sizes="auto, (max-width: 561px) 100vw, 561px" /></p>
<h3>Summary roll statistics</h3>
<p>We could generate a similar chart for every model we are testing. But it is <strong>easier to work with measures of forecast accuracy and bias</strong>, such as <a href="https://www.kddanalytics.com/practical-time-series-forecasting-holdout-sample/" target="_blank" rel="noopener"><strong>MAPE</strong></a> and <a href="https://www.kddanalytics.com/practical-time-series-forecasting-holdout-sample/" target="_blank" rel="noopener"><strong>MPE</strong></a>.</p>
<p>For each roll forecast, we can calculate the MAPE and MPE and observe how they change across the rolling forecasts.</p>
<p>Are the MAPE and MPE constant? Fluctuate with no apparent trend? Or exhibit some systematic trend?</p>
<p>Doing this for every candidate model we are testing generates charts like this which can quickly show any areas of concern:</p>
<p><img data-recalc-dims="1" loading="lazy" decoding="async" class="size-full wp-image-1326 aligncenter" src="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2017/12/Rolling-Holdout-Samples-MAPE.png?resize=604%2C370&#038;ssl=1" alt="" width="604" height="370" srcset="https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2017/12/Rolling-Holdout-Samples-MAPE.png?w=604&amp;ssl=1 604w, https://i0.wp.com/www.kddanalytics.com/wp-content/uploads/2017/12/Rolling-Holdout-Samples-MAPE.png?resize=300%2C184&amp;ssl=1 300w" sizes="auto, (max-width: 604px) 100vw, 604px" /></p>
<p>In this example, candidate models 18 and 15 may be worth further inspection since their MAPEs are much higher than the rest in a recent roll period (roll 6).</p>
<h3>What else makes a model useful?</h3>
<p>So, with respect to the <strong>guidelines</strong> for whittling down a pool of candidate models we listed in an <strong><a href="https://www.kddanalytics.com/practical-time-series-forecasting-what-makes-a-useful-model/" target="_blank" rel="noopener">earlier article</a></strong>, we can add the following from a rolling holdout analysis:</p>
<p><strong>Stability</strong> – The model’s parameters should retain their statistical significance and not vary too much across the rolling periods; and the model&#8217;s residuals should remain &#8220;<strong>white noise</strong>&#8221; across the rolls;</p>
<p><strong>Consistency of Performance</strong> – The model’s forecast accuracy and bias should not exhibit any strong trends, especially trends in the “wrong” direction (i.e. getting progressively worse) as the more recent time period is approached.</p>
<p><strong>Strong Rolling Holdout Sample Performance</strong> – The model’s forecast accuracy and bias, <strong>averaged across all the rolls</strong>, should be high and low respectively. That is <strong>both the average MAPE </strong>and<strong> MPE should be low</strong>.</p>
<h3>Benefits of Rolling</h3>
<p>The primary benefit of a rolling analysis is that we get to see <strong>how a model performs</strong> forecast-wise <strong>over multiple time spans</strong> equal in length to our forecast horizon; <strong>instead of relying on performance in just one holdout sample</strong>.</p>
<p>A rolling analysis also <strong>addresses the issue of a short holdout sample</strong> (e.g. short forecast horizon) <strong>possibly not being representative of the general character of the time series</strong>.</p>
<p>In addition, a rolling analysis can be used as a check for the “best” model chosen using a single holdout sample. That is, would you pick the same model using the rolling holdout approach? If not, why?</p>
<p>In sum, <strong>a model that is persistently better at holdout sample forecasting over a longer time frame is likely to be more robust.</strong></p>
<p>So, let ‘em roll!</p>
<p><a href="https://www.kddanalytics.com/practical-time-series-forecasting-introduction/" target="_blank" rel="noopener"><strong>Part 1 &#8211; Practical Time Series Forecasting &#8211; Introduction</strong></a></p>
<p><a href="https://www.kddanalytics.com/practical-time-series-forecasting-basics/" target="_blank" rel="noopener"><strong>Part 2 &#8211; Practical Time Series Forecasting &#8211; Some Basics</strong></a></p>
<p><a href="https://www.kddanalytics.com/practical-time-series-forecasting-useful-models/" target="_blank" rel="noopener"><strong>Part 3 &#8211; Practical Time Series Forecasting &#8211; Potentially Useful Models</strong></a></p>
<p><a href="https://www.kddanalytics.com/practical-time-series-forecasting-data-science-taxonomy/" target="_blank" rel="noopener"><strong>Part 4 &#8211; Practical Time Series Forecasting &#8211; Data Science Taxonomy</strong></a></p>
<p><a href="https://www.kddanalytics.com/practical-time-series-forecasting-holdout-sample/" target="_blank" rel="noopener"><strong>Part 5 &#8211; Practical Time Series Forecasting &#8211; Know When to Hold &#8217;em</strong></a></p>
<p><a href="https://www.kddanalytics.com/practical-time-series-forecasting-what-makes-a-useful-model/" target="_blank" rel="noopener"><strong>Part 6 &#8211; Practical Time Series Forecasting &#8211; What Makes a Model Useful?</strong></a></p>
<p><a href="https://www.kddanalytics.com/practical-time-series-forecasting-deterministic-stochastic-trend/" target="_blank" rel="noopener"><strong>Part 7 &#8211; Practical Time Series Forecasting &#8211; To Difference or Not to Difference</strong></a></p>
<p>&nbsp;</p>
<p>The post <a href="https://www.kddanalytics.com/practical-times-series-forecasting-rolling-holdout-sample/">Practical Time Series Forecasting – Know When to Roll ‘em</a> appeared first on <a href="https://www.kddanalytics.com">KDD Analytics</a>.</p>
]]></content:encoded>
					
		
		
		<post-id xmlns="com-wordpress:feed-additions:1">1322</post-id>	</item>
	</channel>
</rss>
