### Using DAX for predictions with linear regression and correlation

DAX is a powerful programming language used by Excel and Power BI to manipulate data to gain the most insights. I have been learning both of these software's in extreme detail for over a year now and I have found that one MAJOR drawback is the lack of ability to make predictions using linear regression and Pearson's correlation coefficient. To obtain these statistics, people generally use R, SAS, or some other powerful statistical software but NOT DAX. Inspired by Rob Collie's blog "Movers and Fakers: Spotting Outlier Performance in Power Pivot / Power BI", I realized that, with a little statistics, regression is not a drawback of the software but a limitation on the user's knowledge of statistics. This post is dedicated to teaching Excel and Power BI users how to utilize DAX in order to achieve optimal, accurate predictions and <g class="gr_ gr_89 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace" id="89" data-gr-id="89">data driven</g> decisions based <g class="gr_ gr_107 gr-alert gr_gramm gr_inline_cards gr_run_anim Grammar multiReplace" id="107" data-gr-id="107">off</g> the statistical findings within a dataset.

Watch the video here- https://youtu.be/yDyidzBaFuQ

Before we get to the DAX, let's update our statistics knowledge. Simple linear regression is used to predict the values of one variable (the criterion variable, Y) based on the values of another variable (the predictor variable, X) using the relationship between the two variables. It is referred to as "simple" linear regression because we only have one predictor variable for the criterion variable, but of course, it is possible to have many predictor variables and the more predictor variables, the more variance is explained in Y by X. The strength of the relationship between X and Y is measured by Pearson's correlation coefficient r, or for more accurate predictions- the coefficient of determination, r squared. If r squared is -1, there is a strong negative relationship between these two variables. If r squared is 0, there is absolutely no relation between these two variables. If r squared is 1, then there is a perfect relationship between these variables and there is no variance that exists between these variables (which would be impossible unless both variables are in fact, identical). There is much, MUCH more to be said about these and other statistics but I will try to keep things as short and simple as possible. Like in most math/statistics classes, I will show you the most computationally extensive way of achieving results first, then the easier way, so if statistics <g class="gr_ gr_124 gr-alert gr_gramm gr_inline_cards gr_run_anim Grammar multiReplace" id="124" data-gr-id="124">doesn't</g> excite you like it does me, skip to the end of this post

Let's take a look at the data we will be using for this example (shown below). This is fake data I created in Excel with three columns: Customer ID, Age (X), and Sales (Y). The ultimate goal here is to predict the sales amount, Y, based on the age, X, of customers that bought a hypothetical product.

The formula for Pearson's correlation coefficient is as follows (where n is the number of observations):

http://powerbipro.com/wp-content/uploads/2016/06/Pearsons-r.png 343w" sizes="(max-width: 300px) 100vw, 300px" />

It looks like we will need to create a few calculated columns and measures to make this formula work. The first step is loading our data in Power BI Desktop. After launching Power BI Desktop, we select the "Get Data" icon, then choose Excel and navigate to the location in which the data is stored, then load the data.

http://powerbipro.com/wp-content/uploads/2016/06/reggetdata.png 416w" sizes="(max-width: 243px) 100vw, 243px" />

The formula for Pearson's correlation coefficient needs information from:

As a side note, feel free to copy/paste my formulas and use them on your own data and substitute the names of my columns for yours. We create these columns in Power BI Desktop using the following formulas:

XY = [Age (X)]*[Sales (Y)]

Xsquared = [Age (X)]*[Age (X)]

<g class="gr_ gr_90 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace" id="90" data-gr-id="90">Ysquared</g> = [Sales (Y)]*[Sales (Y)]

The result should look like the screenshot below.

http://powerbipro.com/wp-content/uploads/2016/06/reg4additional.png 507w" sizes="(max-width: 300px) 100vw, 300px" />

Now we need to create some measures to show the sum <g class="gr_ gr_86 gr-alert gr_gramm gr_inline_cards gr_run_anim Punctuation only-del replaceWithoutSep" id="86" data-gr-id="86">of:</g> X, Xsquared, Y, <g class="gr_ gr_85 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace" id="85" data-gr-id="85">Ysquared</g>, and XY using the formulas below.

Xsum = SUM(DemoTable[Age (X)])

Xsquaredsum = SUM(DemoTable[Xsquared])

<g class="gr_ gr_93 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace" id="93" data-gr-id="93">Ysum</g> = SUM(DemoTable[Sales (Y)])

Ysquaredsum = SUM(DemoTable[Ysquared])

XYsum = SUM(DemoTable[XY])

The result of our calculations is shown below:

http://powerbipro.com/wp-content/uploads/2016/06/reg6.png 222w" sizes="(max-width: 210px) 100vw, 210px" />

Now we can create Pearson's correlation coefficient to see if our data has a strong enough relationship for prediction. Remember, we are trying to achieve a result for r that is close to 1 where:

http://powerbipro.com/wp-content/uploads/2016/06/Pearsons-r.png 343w" sizes="(max-width: 300px) 100vw, 300px" />

I split up the formula for r in 4 different parts to make computations easier to read and follow.

r numerator = 10*[XYsum]-[Xsum]*[Ysum]

r denominator 1 = (10*[Xsquaredsum]-[Xsum]^2)*(10*[Ysquaredsum]-[Ysum]^2)

r denominator 2 = SQRT([r denominator 1])

r = DIVIDE([r numerator],[r denominator 2])

The correlation coefficient for our data is:

We simply square this number to achieve the coefficient of determination, r squared:

<g class="gr_ gr_112 gr-alert gr_gramm gr_inline_cards gr_run_anim Grammar multiReplace" id="112" data-gr-id="112"><g class="gr_ gr_94 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del" id="94" data-gr-id="94">A r</g></g> squared value of 0.8516 is a great indicator that X and Y have a strong linear relationship. Now <g class="gr_ gr_113 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling multiReplace" id="113" data-gr-id="113">lets</g> find the equation of the linear regression line that we will use to make predictions with. The formula for a line is usually in the format: Y = <g class="gr_ gr_95 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace" id="95" data-gr-id="95">mx</g> + b (where m is the slope of the line, X is the predictor variable and b is the <g class="gr_ gr_96 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace" id="96" data-gr-id="96">Y intercept</g>). The formula for finding m and b is as follows:

The DAX measures I used to find the equation of the line is as follows:

Slope = (10*[XYsum] – [Xsum]*[Ysum])/(10*[Xsquaredsum] – [Xsum]^2)

<g class="gr_ gr_97 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del multiReplace" id="97" data-gr-id="97">Y Intercept</g> = DIVIDE(([Ysum]*[Xsquaredsum] – [Xsum]*[XYsum]),(10*[Xsquaredsum] – [Xsum]^2))

Putting the results of these two measures together produces the formula for the line:

<g class="gr_ gr_98 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling ins-del" id="98" data-gr-id="98">FINALLY</g>, we are able to start making predictions about the sales amount, Y, based on the values of age, X.

Let's see how much sales <g class="gr_ gr_100 gr-alert gr_gramm gr_inline_cards gr_run_anim Grammar multiReplace" id="100" data-gr-id="100">is</g> predicted for the ages 10, 30, and 50. We create the last measures needed for this analysis by substituting these ages into the formula of our regression line.

10 Sales = 0.9675*10 – 5.0895

30 Sales = 0.9675*30 – 5.0895

50 Sales = 0.9675*50 – 5.0895

The results of these measures are as follows:

We can see that as age increases, the sales amount increases so if we were advertising our hypothetical product, we should target individuals around 50 years old.

Neat huh?

You can also add trendlines to scatterplots in Power BI Desktop.

------------------------------

Pablo Moreno

Business Analyst Snr Advisor

Quest

Panama

69444206

------------------------------