which aic is better?
Lower AIC scores are better, and AIC penalizes models that use more parameters. So if two models explain the same amount of variation, the one with fewer parameters will have a lower AIC score and will be the better-fit model.
which model is best?
The formula for AIC is, AIC = 2k - 2(LL), where k is the number of parameters and LL is the log maximum likelihood.
The Akaike Information Criterion (AIC) lets you test how well your model fits the data set without over-fitting it.
The AIC score rewards models that achieve a high goodness-of-fit score and penalizes them if they become overly complex.
By itself, the AIC score is not of much use unless it is compared with the AIC score of a competing model.
The model with the lower AIC score is expected to strike a superior balance between its ability to fit the data set and its ability to avoid over-fitting the data set.
The formula for the AIC score is as follows:
The AIC formula is built upon 4 concepts which themselves build upon one another as follows:
Let’s take another look at the AIC formula, but this time, let’s re-organize it a bit:
Let’s recollect that a smaller AIC score is preferable to a larger score. Using the rewritten formula, one can see how the AIC score of the model will increase in proportion to the growth in the value of the numerator, which contains the number of parameters in the model (i.e. a measure of model complexity). And the AIC score will decrease in proportion to the growth in the denominator which contains the maximized log likelihood of the model (which, as we just saw, is a measure of the goodness-of-fit of the model).
The AIC score is useful only when its used to compare two models. Let’s say we have two such models with k1 and k2 number of parameters, and AIC scores AIC_1 and AIC_2.
Assume that AIC_1 < AIC_2 i.e. model 1 is better than model 2.
How much worse is model 2 than model 1? This question can be answered by using the following formula:
Why use the exp() function to compute the relative likelihood? Why not just subtract AIC_2 from AIC_1? For one thing, the exp() function ensures that the relative likelihood is always a positive number and hence easier to interpret.
If you build and train an Ordinary Least Squares Regression model using the Python statsmodels library, statsmodels
Let’s perform what might hopefully turn out to be an interesting model selection experiment. We’ll use a data set of daily average temperatures in the city of Boston, MA from 1978 to 2019. This data can be downloaded from NOAA’s website.
The raw data set, (which you can access over here), contains the daily average temperature values. The first few rows of the raw data are reproduced below:
For our model selection experiment, we’ll aggregate the data at a month level.
After aggregation, which we’ll soon see how to do in pandas, the plotted values for each month look as follows:
Let’s also plot the average temperature TAVG against a time lagged version of itself for various time lags going from 1 month to 12 months. Following is the set of resulting scatter plots:
There is clearly a strong correlation at LAGS 6 and 12 which is to be expected for monthly averaged temperature data. Other lags such as LAG1, LAG5 and LAG7 may also exhibit a significant ability to explain some of the variance in the target variable’s value. We’ll find out soon enough if that’s true.
Our regression goal will be to create a model that will predict the monthly average temperature in Boston, namely the TAVG value. Therefore our target, a.k.a. the response variable, will be TAVG.
Our regression strategy will be as follows:
Let’s implement this strategy.
Import all the required packages.
Read the data set into a pandas data frame.
The data set contains daily average temperatures. We want monthly averages. So let’s roll up the data to a month level. This turns out to be a simple thing to do using pandas.
We are about to add lagged variable columns into the data set. Let’s create a copy of the data set so that we don’t disturb the original data set.
Add 12 columns, each one containing a time-lagged version of TAVG.
Print out the first 15 rows of the lagged variables data set.
This prints out the following output:
The first 12 rows contain NaNs introduced by the shift function. Let’s remove these 12 rows.
Print out the first few rows just to confirm that the NaNs have been removed.
Before we do any more peeking and poking into the data, we will put aside 20% of the data set for testing the optimal model.
Now let’s create all possible combinations of lagged values. For this, we’ll create a dictionary in which the keys contain different combinations of the lag numbers 1 through 12.
Next, we will iterate over all the generated combinations. For each lag combination, we’ll build the model’s expression using the Patsy syntax. Next we’ll build the linear regression model for that lag combination of variables, we’ll train the model on the training data set, we’ll ask statsmodels to give us the AIC score for the model, and we’ll make a note of the AIC score and the current ‘best model’ if the current score is less than the minimum value seen so far. We’ll do all of this in the following piece of code:
Finally, let’s print out the summary of the best OLSR model as per our evaluation criterion. This is the model with the lowest AIC score.
This prints out the following output. I have highlighted a few interesting areas in the output:
Let’s inspect the highlighted sections.
Our AIC score based model evaluation strategy has identified a model with the following parameters:
The other lags, 3, 4, 7, 8, 9 have been determined to not be significant enough to jointly explain the variance of the dependent variable TAVG. For example, we see that TAVG_LAG_7 is not present in the optimal model even though from the scatter plots we saw earlier, there seemed to be a good amount of correlation between the response variable TAVG and TAVG_LAG_7. The reason for the omission might be that most of the information in TAVG_LAG_7 may have been captured by TAVG_LAG_6, and we can see that TAVG_LAG_6 is included in the optimal model.
The second thing to note is that all parameters of the optimal model, except for TAVG_LAG_10, are individually statistically significant at a 95% confidence level on the two-tailed t-test. The reported p-value for their ‘t’ score is smaller than 0.025 which is the threshold p value at a 95% confidence level on the 2-tailed test.
The third thing to note is that all parameters of the model are jointly significant in explaining the variance in the response variable TAVG.
This can be seen from the F-statistic 1458. It’s p value is 1.15e-272 at a 95% confidence level. This probability value is so incredibly tiny that you don’t even need to look up the F-distribution table to verify that the F-statistic is significant. The model is definitely much better at explaining the variance in TAVG than an intercept-only model.
Finally, let’s take a look at the AIC score of 1990.0 reported by statsmodels, and the maximized log-likelihood of -986.86.
We can see that the model contains 8 parameters (7 time-lagged variables + intercept). So as per the formula for the AIC score:
AIC score = 2*number of parameters —2* maximized log likelihood = 2*8 + 2*986.86 = 1989.72, rounded to 1990. 0
Which is exactly the value reported by statmodels.
The final step in our experiment is to test the optimal model’s performance on the test data set. Remember that the model has not seen this data during training.
We will ask the model to generate predictions on the test data set using the following single line of code:
Let’s get the summary frame of predictions and print out the first few rows.
The output looks like this:
Next, let’s pull out the actual and the forecasted TAVG values so that we can plot them:
Finally, let’s plot the predicted TAVG versus the actual TAVG from the test data set.
The plot looks like this:
In the above plot, it might seem like our model is amazingly capable of forecasting temperatures for several years out into the future! However, the reality is quite different. What we are asking the model to do is to predict the current month’s average temperature by considering the temperatures of the previous month, the month before etc., in other words by considering the values of the model’s parameters: TAVG_LAG1, TAVG_LAG2, TAVG_LAG5, TAVG_LAG6, TAVG_LAG10, TAVG_LAG11, TAVG_LAG12 and the intercept of regression.
We are asking the model to make this forecast for each time period, and we are asking it to do so for as many time periods as the number of samples in the test data set. Thus our model can reliably make only one month ahead forecasts. This behavior is entirely expected given that one of the parameters in the model is the previous month’s average temperature value TAVG_LAG1.
This completes our model selection experiment.
Here is the complete Python code used in this article:
The data set is available here.
Let’s summarize the important points:
Monthly average temperature in the city of Boston, Massachusetts (Source: NOAA)
Akaike H. (1998) Information Theory and an Extension of the Maximum Likelihood Principle. In: Parzen E., Tanabe K., Kitagawa G. (eds) Selected Papers of Hirotugu Akaike. Springer Series in Statistics (Perspectives in Statistics). Springer, New York, NY. https://doi.org/10.1007/978-1-4612-1694-0_15
All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.
PREVIOUS: The F-Test for Regression Analysis
NEXT: The Chi-squared Test
UP: Table of Contents
In this article, I will cover the following topics:
AIC is most frequently used in situations where one is not able to easily test the model’s performance on a test set in standard machine learning practice (small data or a time series.) AIC is particularly valuable for time series, because time series analysis’ most valuable data is often the most recent, which is stuck in the validation and test sets. As a result, training on all the data and using AIC can result in improved model selection over traditional train/validation/test model selection methods.
AIC works by evaluating the model’s fit on the training data and adding a penalty term for the complexity of the model (similar fundamentals to regularization.) The desired result is to find the lowest possible AIC, which indicates the best balance of model fit with generalizability. This serves the eventual goal of maximizing fit on out-of-sample data.
AIC uses a model’s maximum likelihood estimation (log-likelihood) as a measure of fit. Log-likelihood is a measure of how likely one is to see their observed data, given a model. The model with the maximum likelihood is the one that “fits” the data the best. The natural log of the likelihood is used as a computational convenience.
AIC is low for models with high log-likelihoods. This means the model fits the data better, which is what we want. But it adds a penalty term for models with higher parameter complexity, since more parameters means a model is more likely to overfit to the training data.
More on Machine Learning: What Is Logistic Regression?
AIC is typically used when you don’t have access to out-of-sample data and want to decide between multiple different model types, or for time convenience. My most recent motivation to use AIC was when I was quickly evaluating multiple seasonal autoregressive integrated moving average (SARIMA) models to find the best baseline model, and I wanted to quickly evaluate this while retaining all the data in my training set.
When evaluating SARIMA models, it’s important to note that AIC makes an assumption that all models are trained on the same data. So, using AIC to decide between different orders of differencing is technically invalid, since one data point is lost through each order of differencing. You must be able to fulfill AIC’s assumptions. AIC makes assumptions that you:
That last assumption is because AIC converges to the correct answer with an infinite sample size. Often, a large sample is good enough to approximate, but since using AIC often means that you have a small sample size, there is a sample-size adjusted formula called AICc. This formula adds a correction term that converges to the AIC answer for large samples, but it gives a more accurate answer for smaller samples.
As a rule of thumb, you should always use AICc to be safe, but AICc should especially be used when the ratio of your data points (n) : # of parameters (k) is < 40.
Once the assumptions of AIC (or AICc) have been met, the biggest advantage is that your models do not need to be nested for the analysis to be valid unlike other single-number measurements of model fit like the likelihood-ratio test. A nested model is a model whose parameters are a subset of the parameters of another model. As a result, vastly different models can be compared mathematically with AIC.
Once you have a set of AIC scores, what do you do with them? Pick the model with the lowest score as the best? You could do that, but AIC scores are a probabilistic ranking of the models that are likely to minimize the information loss (best fit the data). I’ll explain via the formula below.
Assume you have calculated the AICs for multiple models and you have a series of AIC scores (AIC_1, AIC_2, … AIC_n). For any given AIC_i, you can calculate the probability that the “ith” model minimizes the information loss through the formula below, where AIC_min is the lowest AIC score in your series of scores.
There’s a great example on this, with two sample AIC scores of 100 and 102 leading to the mathematical result that the 102-score model is 0.368 times as probable as the 100-score model to be the best model. An AIC of 110 is only 0.007 times as probable to be a better model than the 100-score AIC model. While this means that you can never know when one model is better than another from AIC — it is only using in-sample data, after all — there are strategies to handle these probabilistic results:
If the precision of your answer is not of the utmost importance, and you want to simply select the lowest AIC, know that you are more likely to have picked a suboptimal model if there are other AIC scores that are close to the minimum AIC value of your experiments. A score of 100 vs 100.1 may leave you indifferent between the two models, compared to 100 versus 120, for example.
More on Data Science: 12 Data Science Projects for Beginners
The best-fit model according to AIC is the one that explains the greatest amount of variation using the fewest possible independent variables.
In statistics, AIC is most often used for model selection. By calculating and comparing the AIC scores of several possible models, you can choose the one that is the best fit for the data.
When testing a hypothesis, you might gather data on variables that you aren’t certain about, especially if you are exploring a new idea. You want to know which of the independent variables you have measured explain the variation in your dependent variable.
A good way to find out is to create a set of models, each containing a different combination of the independent variables you have measured. These combinations should be based on:
Once you’ve created several possible models, you can use AIC to compare them. Lower AIC scores are better, and AIC penalizes models that use more parameters. So if two models explain the same amount of variation, the one with fewer parameters will have a lower AIC score and will be the better-fit model.
AIC determines the relative information value of the model using the maximum likelihood estimate and the number of parameters (independent variables) in the model. The formula for AIC is:
K is the number of independent variables used and L is the log-likelihood estimate (a.k.a. the likelihood that the model could have produced your observed y-values). The default K is always 2, so if your model uses one independent variable your K will be 3, if it uses two independent variables your K will be 4, and so on.
To compare models using AIC, you need to calculate the AIC of each model. If a model is more than 2 AIC units lower than another, then it is considered significantly better than that model.
You can easily calculate AIC by hand if you have the log-likelihood of your model, but calculating log-likelihood is complicated! Most statistical software will include a function for calculating AIC. We will use R to run our AIC analysis.
To compare several models, you can first create the full set of models you want to compare and then run aictab() on the set.
For the sugar-sweetened beverage data, we’ll create a set of models that include the three predictor variables (age, sex, and beverage consumption) in various combinations. Download the dataset and run the lines of code in R to try it yourself.
Download the sample dataset
First, we can test how each variable performs separately.
Next, we want to know if the combination of age and sex are better at describing variation in BMI on their own, without including beverage consumption.
We also want to know whether the combination of age, sex, and beverage consumption is better at describing the variation in BMI than any of the previous models.
Finally, we can check whether the interaction of age, sex, and beverage consumption can explain BMI better than any of the previous models.
To compare these models and find which one is the best fit for the data, you can put them together into a list and use the aictab() command to compare all of them at once. To use aictab(), first load the library AICcmodavg.
Then put the models into a list (‘models’) and name (label) each of them so the AIC table is easier to read (‘model.names’).
Finally, run aictab() to do the comparison.
The code above will produce the following output table:
The best-fit model is always listed first. The model selection table includes information on:
From this table we can see that the best model is the combination model – the model that includes every parameter but no interactions (bmi ~ age + sex + consumption).
The model is much better than all the others, as it carries 96% of the cumulative model weight and has the lowest AIC score. The next-best model is more than 2 AIC units higher than the best model (6.33 units) and carries only 4% of the cumulative model weight.
Based on this comparison, we would choose the combination model to use in our data analysis.
If you are using AIC model selection in your research, you can state this in your methods section of your thesis, dissertation, or research paper. Report that you used AIC model selection, briefly explain the best-fit model you found, and state the AIC weight of the model.
After finding the best-fit model you can go ahead and run the model and evaluate the results. The output of your model evaluation can be reported in the results section of your paper.