Autobox Blog

Thoughts, ideas and detailed information on Forecasting.

  • Home
    Home This is where you can find all the blog posts throughout the site.
  • Categories
    Categories Displays a list of categories from this blog.
  • Tags
    Tags Displays a list of tags that has been used in the blog.
  • Bloggers
    Bloggers Search for your favorite blogger from this site.
Subscribe to this list via RSS Blog posts tagged in arimax level shift

Let's take a look at Microsoft's Azure platform where they offer machine learning. I am not real impressed. Well, I should state that it's not really a Microsoft product as they are just using an R package. There is no learning here with the models being actually built. It is fitting and not intelligent modeling. Not machine learning.

The assumptions when you do any kind of modeling/forecasting is that the residuals are random with a constant mean and variance.  Many aren't aware of this unless you have taken a course in time series.

Azure is using the R package auto.arima to do it's forecasting. Auto.arima doesn't look for outliers or level shifts or changes in trend, seasonality, parameters or variance.

Here is the monthly data used. 3.479,3.68,3.832,3.941,3.797,3.586,3.508,3.731,3.915,3.844,3.634,3.549,3.557,3.785,3.782,3.601,3.544,3.556,3.65,3.709,3.682,3.511, 3.429,3.51,3.523,3.525,3.626,3.695,3.711,3.711,3.693,3.571,3.509

It is important to note that when presenting examples many will choose a "good example" so that the results can show off a good product.  This data set is "safe" as it is on the easier side to model/forecast, but we need to delve into the details that distinguish the difference between real "machine learning" vs. fitting approaches.  It's important to note that the data looks like it has been scaled down from a large multiple.  Alternatively, if the data isn't scaled and really is 3 digits out then you also are looking for extreme accuracy in your forecast.  The point I am going to make now is that there is a small difference in the actual forecasts, but the level(lower) that Autobox delivers makes more sense and that it delivers residuals that are more random.  The important term here is "is it robust?" and that is what Box-Jenkins stressed and coined the term "robustness".

Here is the model when running this using auto.arima.  It's not too different than Autobox's except one major item which we will discuss.

The residuals from the model are not random.  This is a "red flag". They clearly show the first half of the data above 0 and the second half below zero signaling a "level shift" that is missing in the model.

Now, you could argue that there is an outlier R package with some buzz about it called "tsoutliers" that might help.  If you run this using tsoutliers,  a SPURIOUS Temporary Change(TC) up (for a bit and then back to the same level is identified at period #4 and another bad outlier at period #13 (AO). It doesn't identify the level shift down and made 2 bad calls so that is "0 for 3". Periods 22 to 33 are at a new level, which is lower. Small but significant. I wonder if MSFT chose not to test use the tsoutliers package here.

 

Autobox's model is just about the same, but there is a level shift down beginning at period 11 of a magnitude of .107.

Y(T) =  3.7258                                azure                                                                     
       +[X1(T)][(-  .107)]                              :LEVEL SHIFT       1/ 11    11
      +     [(1-  .864B** 1+  .728B** 2)]**-1  [A(T)]

Here are both forecasts.  That gap between green and red is what you pay for.



Note that the Autobox upper confidence limits are much lower in level.

 

Autobox's residuals are random

 

 

 

 

 

Modeling ARIMA(x) or otherwise known as a Transfer Function models aren't easy to model especially with outliers.  A new book Data Quality for Analytics Using SAS by Gerhard Svolba from SAS shows this to be true.  Click on the link and you will see the graph and the explanation of which outliers were identified.


I am going to make this post short and to the point.

The January 2007 value is an outlier and should have been flagged as one although the author tries to ignore it,  but we do not.

December 2006, January 2008, November 2008, December 2008 are also missed as they are clear outliers.

I will also point out the data seems to be trending up and the forecast is flat, but we don't know what the future values of the causals used so its tough to give a complete view here.

If you have the book and perhaps the data, post it here or send it to us and we will gladly analyze it or any data!

Follow up.....

We downloaded the data and SAS' Universal Viewer. The 4 data sets that they let you download only has transaction level data and it doesn't overlap the time frame for the example. So if the data is not listed in the book, the only way to get it would be to contact the author himself. Here is the author's contact page if anyone wants to do that.

http://support.sas.com/publishing/bbu/companion_site/info.html

15 Questions surrounding outliers and time series forecasting/modeling

 

Does your current forecasting process automatically adjust for outliers? (correct answer Yes)

Do you make a separate run for certain problems that SHOULD NOT get adjusted for outliers as the outliers are in fact real and shouldn't be adjusted?  (correct answer Yes)

Do you know what standard deviations are used to identify an outlier?  (correct answer "who cares" You shouldn't be having to tell the system)

Who knows that the standard deviation calculation is itself skewed by the outlier?  (correct answer "who cares" You shouldn't be having to tell the system)

Does the system ask you how many times it should "iterate" when removing outliers? How many times do you "iterate"? (correct answer "who cares" You shouldn't be having to tell the system)

Does the system allow you to convert outliers to causals and flag future values when the event will happen again?  (correct answer Yes)

Does the system identify inliers? ie. 1,9,1,9,1,9,1,5  (correct answer Yes)

Does the system recognize the difference between an outlier and a seasonal pulse?  (correct answer Yes) (IE 1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,etc)

Does the system recognize the difference between an outlier and a level shift?  (correct answer Yes) (IE 0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,etc)

Does the system recognize the difference between an outlier and a change in the trend?  (correct answer Yes)  (IE 0,0,0,0,0,0,0,1,2,3,4,5,6,7,8,etc)

Does the system allow you to force the outlier in the most recent period to be a "level shift" or a "seasonal pulse"?  (correct answer Yes)

Does the system report a file adjusted for outliers for pure data cleansing purposes?  (correct answer Yes)

Does the system adjust for outliers in a time series regresion (ie ARIMAX/Transfer Function)? (correct answer Yes)

Who tries to find the assignable cause why the outliers existed?   (correct answer I do)

Who then provides causals to explain the outliers to the system?   (correct answer I do)

Go to top