The Impact of Machine Learning in Data Analysis
A comprehensive exploration of the strengths and weaknesses of traditional regression methods versus pure prediction algorithms, with a focus on the critical question of when to apply each approach. Insights from recent studies, competitions, and conferences highlight the evolving landscape in data analysis and forecasting methodologies, showcasing the benefits of integrating statistical models with machine learning techniques for enhanced accuracy.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Machine Learning is Better But When? Keith Ord Professor Emeritus of Business Statistics
When? A recent study by Efron (2020) provides a detailed assessment of the strengths and weaknesses of traditional regression-type methods and pure prediction algorithms. Efron states When they are [his italics] suitable, the pure prediction methods can be stunningly successful. The key question is When? Federal Forecasters Conference 9/9/2222/2210/3/2024
Outline M5 Uncertainty Competition: Overview Issues with Data Analysis The Weight of Numbers Are the criteria used in the study appropriate? Case study using Box-Jenkins Airline data What are the take-home lessons? Federal Forecasters Conference 3 10/3/2024
M5 Uncertainty Competition [See Makridakis, Spiliotis and Assimakopoulos (2022b) report on the M5 Uncertainty Competition[refer to as MSA22b] Data refer to day-by-day product sales by Walmart The data are hierarchical by state, store, category, department, product. The most granular levels (numbered 10-12) refer to Product, Product*State and Product*Store 42,840 series relating to 3,049 products; all but 154 series in Levels 10-12 Additional information was available on holidays and prices 1913 days for model development and estimation; 28 days of forecasts (1 to 28 days ahead) generated for evaluation 892 entries in Kaggle competition 6 statistical benchmarks: ARIMA and exponential smoothing methods; no consideration of holidays or price data. No combinations of benchmark methods Federal Forecasters Conference 4 10/3/2024
Published Conclusions "The most important finding of the M4 competition was that all of the top- performing methods, in terms of both PFs (Point Forecasts) and PIs (Prediction Intervals) were combinations of mostly statistical models, with such combinations being more accurate numerically than either pure statistical or pure ML methods. Makridakis, Spiliotis and Assimakopoulos (2020) report on the M4 Competition Two principal conclusions in the M5 Competition: The superior performance of relatively simple ML (machine learning) methods as well as the significantly worse performance of statistical methods which did not make it to the top ranks. The [substantial] improvement [ ] of the winning method over the most accurate statistical benchmark . Makridakis, Spiliotis and Assimakopoulos (2022a) report on the M5 Accuracy Competition WHAT CHANGED? Federal Forecasters Conference 5 10/3/2024
Data Analysis Many of the series extend to nearly 2,000 observations; ML methods typically require long series for fitting, whereas statistical methods do not. Conclusions do not extend to other contexts with short series. No use of combinations of statistical methods, despite recommendations from M4. No reporting of preliminary data analysis. Given the nature of retail sales data, it is reasonable to surmise that many of the series (particularly at the finer gradations of levels 10-12) would be stationary and dominated by the weekly cycle, along with holiday effects and special promotions (not considered in the statistical benchmarks). Sales at the product level may be intermittent, but no intermittent series statistical benchmarks were used. Federal Forecasters Conference 6 10/3/2024
Data Analysis: Conclusions The net effect of these design issues is that the statistical benchmarks were not able to not bring their A-game to the competition. Conclusion: ML methods are a useful way of incorporating additional data. Standard time series packages need to do more to incorporate explanatory variables. 9/22/2022 Federal Forecasters Conference 7 7
Weight of Numbers, 1 Consider the performance of a star athlete relative to a group of lesser performers, e.g., in a tennis tournament. Does the star always win? NO. Both the margin of superiority (measured as the number of standard deviations for a normal distribution) and the number of other competitors, N, matter. P(Serena Wins) 1 0.8 Probability 0.6 0.4 0.2 0 0 0.5 1 1.5 2 2.5 3 Advantage in SD N_1 N_127 Federal Forecasters Conference 8 10/3/2024
Weight of Numbers, 2 How can we allow for the weight of numbers effect in the present case? We have 892 ML methods and 6 benchmarks. If we select 6 methods at random out of 898, what is the expected rank of the best benchmark? The expected rank of the best benchmark is 126 (evaluated by simulation). MSA22b give the rank of the best benchmark for the overall WPSL score as 203 (202 designed methods beat the best performing benchmark ). From MSA22b, Tables 2 and 5 the rank of the best benchmark (ARIMA) for levels 10-12 is 41, well above the expected 126. CONCLUSION: The best ML methods are superior to the leading benchmark in an overall sense, but the margin of superiority would seem to be less than suggested and does not apply at levels 10 12. Federal Forecasters Conference 9 10/3/2024
Comparison of PIs (as in MSA22b) Measure the trade off between width and coverage: Penalty for width + Penalty for missing the observation U=Upper tail, L=Lower tail, MSIS = Mean Scaled Interval Score Y= observation, F = forecast, U =upper PI, L = lower PI, (1- ) = coverage probability n h + 1 h 2 = + ) ( [( ) ( )] UMSIS U F y U y U , and t t t t t t t n = + 1 n h + 1 h 2 = + ) ( [( ) ( )] LMSIS F L L y L y . t t t t t t t n = + 1 MSIS = UMSIS + LMSIS Federal Forecasters Conference 10 10/3/2024
Case Study: The Box-Jenkins Airline Series Time Series Plot of Air Passenger 600 500 Air Passenger 400 300 200 100 1 14 28 42 56 70 Index 84 98 112 126 140 Federal Forecasters Conference 11 10/3/2024
Models for the Airline Data The classic airline model is ARIMA(0,1,1)(0,1,1)12 The series clearly showed increasing variance over time, which led initially to the use of a logarithmic transform. Chatfield and Prothero (1973) demonstrated the log form may well be an over-correction for such series. A cube root transform has been suggested. Could a study of PIs have identified the weakness of the log transform? The airline data series consists of 144 observations. The series was divided into an estimation sample of 96 observations and a test sample of 48 observations. As an alternative, we consider the ARIMA(1,0,0)(0,1,1)12+C model. Federal Forecasters Conference 12 10/3/2024
Results of fitting the two models to the first eight years of data (n = 96) . Estimation Sample Constant AR(1) MA(1) MA(12) RMSE (Transformed data) Linear Logarithmic Cube Root Airline AlternateAirline 5.067 0.818 0.285 9.347 AlternateAirline 0.431 0.678 0.862 0.0360 Alternate 0.075 0.712 0.217 0.230 9.641 0.369 0.669 0.0384 0.299 0.593 0.0786 0.796 0.0738 The alternate model has slightly lower RMSE but uses an extra parameter not much to choose between the two using AICc. Federal Forecasters Conference 13 10/3/2024
Results for the holdout sample, last four years (n=48) MFE = Mean Forecast Error, MAFE = Mean Absolute Forecast Error Holdout Sample MFE MAFE Linear Logarithmic Cube Root Airline AlternateAirline 19.75 24.59 AlternateAirline -45.64 47.43 Alternate -8.50 23.32 8.38 20.04 -36.53 38.55 -12.18 21.41 Log transform clearly inferior Linear undershoots Cube root overshoots by about the same amount Airline marginally better than Alternate Federal Forecasters Conference 14 10/3/2024
Prediction Interval Performance Measures Scaled Interval Score LMSIS UMSIS MSIS Failure Rate LFR (2.5%) UFR (2.5%) FR (5.0%) Linear Logarithmic Cube Root Airline Alternate Airline 3.31 3.31 6.62 0 0 16.67 0 16.67 Alternate Airline 15.99 1.30 27.29 60.42 0 60.42 Alternate 1.17 5.23 6.40 3.30 4.75 8.05 3.05 3.81 6.86 2.65 1.20 3.85 0 0 0 0 0 0 0 18.75 4.17 22.92 For MSIS, Cube root + Alternate is best For Failure Rates, expect 2.5% of misses in each tail Airline never misses = Intervals too wide Federal Forecasters Conference 15 10/3/2024
Which Criterion Should we Choose? The reason for forecasting is to aid planning. Thus, an element of judgment should be incorporated into the model selection process. Think about planning capacity for a new airport. Only the upper tail matters! Federal Forecasters Conference 16 10/3/2024
Practical Advice [from M4 Competition] Fry and Brundage (2020, p.157) observe: In fact, for some capacity problems we find that we find that it is often more important to get the upper quantiles correct than to actually get the point forecasts correct. Michael Gilliland (2020, p.163) comments: Yaniv and Foster (1995) found that people may prefer a narrower interval that does not include the true value to a wider interval that does. Du et al. (2011) found that people will tolerate intervals up to a certain width that some degree of uncertainty is unavoidable but that wider intervals lose credibility. Federal Forecasters Conference 17 10/3/2024
Performance of interval forecasts 1 to 48 steps ahead ARIMA (0,1,1)(0,1,1)12 ARIMA(1,0,0)(0,1,1)12 +C Federal Forecasters Conference 18 10/3/2024
Conclusions and Directions for Future Research 1. Choosing a forecasting method based only on point forecasts may give poor results for interval forecasts context matters When specifying a PI, need to decide on the width versus coverage tradeoff Upper and lower tail results may be quite different what is important?? Competition results are data-dependent, criteria-dependent and depend on the strength of the opposition Choice of Methods? Think Horses for Courses! (see Li et al., 2020; Ord, et al., 2017) 2. 3. 4. 5. Research Questions: 1. Should we be considering prediction distributions and loss functions, such as the trade-off between holding costs and stockout costs for retail sales planning? 2. How might we incorporate a trade-off between point and interval forecast performance into automatic forecasting packages? e.g. ???? + ? ???? as the objective function? Federal Forecasters Conference 19 10/3/2024
Last Word? We argue that this distinction [between ML and SM] does not stem from fundamental differences in the methods assigned to either class. Instead, this distinction is probably of a tribal nature, [my italics] which limits the insights into the appropriateness and effectiveness of different forecasting methods. (Januschowski et al., 2020) It is worth noting that the most successful ML procedure was LightGBM a form of nonlinear regression. Federal Forecasters Conference 20 10/3/2024
. SO, WHEN?? THANK YOU! Federal Forecasters Conference 21 10/3/2024
References Chatfield, C. (1993). Calculating interval forecasts. Journal of Business and Economic Statistics, 11, 121-135. Du, N., Budesco, D., Shelly, M. & Omer, T. (2011). The appeal of vague financial forecasts. Organizational Behavior and Human Decision Processes, 114, 179-189 Efron. B. (2020). Prediction, Estimation and Attribution. Journal of the American Statistical Association, 115, 636-655. Fry, C. & Brundage, M. (2020). The M4 forecasting competition A practitioner s view. International Journal of Forecasting, 36, 157-160. Gilliland, M. (2020). The value added by machine learning approaches in forecasting. International Journal of Forecasting, 36, 161-166. Januschowski, T., Gashaus, J., Wang, Y., Salinas, D., Flunkert, V., Bohlke-Schneider, M. and Callot, L. (2020). Criteria for classifying forecasting methods. International Journal of Forecasting, 36, 167-177 Li, Y., Berry, D., & Lee, J. (2020) How to choose among three forecasting methods. Foresight, Issue 58, 7 14. Makridakis, S., Spiliotis, E. & Assimakopoulos, V. (2020). The M4 Competition: 100,000 time series and 61 forecasting methods. International Journal of Forecasting, 36, 54 74. Makridakis, S., Spiliotis, E. & Assimakopoulos, V. (2022a). The M5 Accuracy Competition: Results, Findings and Conclusions. International Journal of Forecasting, to appear. Makridakis, S., Spiliotis, E. & Assimakopoulos, V. (2022b). The M5 Uncertainty Competition: Results, Findings and Conclusions. International Journal of Forecasting, to appear. Ord, K., Fildes, R. & Kourentzes, N. (2017). Principles of Business Forecasting. New York:Wessex Press. 2nd edition. Yaniv, L. & Foster, D. (1995). Graininess of judgment under uncertainty: An accuracy-awareness tradeoff. Journal of Experimental Psychology: General, 124(4), 424-432 Federal Forecasters Conference 10/ 3/2
ML and SM Methods The Kaggle platform brought a welcome infusion of computer scientists into the competition and the quality of ML methods used in M5 represents a considerable improvement over M4. But are the conclusions presented in MSA21 reasonable? ML methods are often based upon statistical methodologies. For example, MSA21 refers to LightGBM as a ML algorithm for performing non-linear regression using gradient boosted trees, and neural net methods build upon logistic regression and multilevel structures. What is new is the computing power now available to develop large-scale models involving multiple series, and to search for nonlinear structure. Federal Forecasters Conference 10/ 3/2
Choice of Criteria The Scaled Pinball Loss function (SPL) is well-balanced in scoring under- and over-estimates for continuous distributions and it worked well for aggregation levels 1-9. However, it is probable that many of the series at levels 10-12 are lumpy with discrete positively skew distributions. A second concern is the aggregation across all twelve levels at which data were recorded. This step is needed to produce an overall WSPL score, but the results are not uniform across the different levels. Rank correlations for the entries in MSA21 Table 2 [WSPL scores for top 50 methods] across the three subgroups comprising levels 1-5, levels 6- 9 and levels 10-12: Level 1-5 Level 6-9 Level 10-12 Level 1- 5 Level 6- 9 Level 10-12 1 0.60 1 0.31 0.29 1 Federal Forecasters Conference 10/ 3/2
1. The superiority of relatively simple methods is reaffirmed, although the statement should perhaps be What can statistical modelers learn going forward? more nuanced: Don t use methods that are beyond what the data can bear. The present study uses a large information set about 2,000x30,000=60 million data points, which will not be available in many applications. 2. How long a series is needed to sustain a particular method remains an important practical question. 3. The importance of explanatory variables depends both on the context and on the forecaster s objectives. If the focus is on policy evaluation, key variables must factor into the analysis. Likewise, calendar and pricing variables may be important for policy purposes. Forecasting software needs to improve the ability to incorporate such effects. 4. The question of compatible interval forecasts for different aggregation levels is intriguing and deserves further exploration. However, a note of caution is needed as it is not clear what compatibility even means in this context. Indeed, it is perhaps the wrong question to ask, and the focus should be on compatible predictive distributions. 5. Any attempt to construct prediction intervals should keep in mind the timeless advice of Chatfield (1993), who pointed to several possible reasons for the failure of prediction intervals to perform adequately, including: a. Uncertainty in model choice, or the model may change over time, b. Uncertainty in the pattern of future random errors, c. Uncertainty in parameter estimation. 6. Where the data structure allows, future competitions should consider a more complete range of statistical approaches such as restricted vector models that incorporate explanatory variables. Combinations of methods should also be considered, given their past successes. Federal Forecasters Conference 10/ 3/2