A Guide to Evaluating Models

Posted on November 18, 2007 by


One of my tasks this past weekend – beyond watching football and basketball — was to review a paper for the Journal of Sports Economics.  Although I can’t reveal the details of the paper, I can note that this paper was relatively standard in the sports economics literature.  A model was proposed, estimated, and discussed. My task as a reviewer was to evaluate the “quality” of this work.

So how do we do this?  How do we evaluate a model? My sense in reading comments on-line is that this is not a well understood process.  So I thought I would take the time to write down some basic guidelines I and other researchers consider when we review a model. 

It’s important to note -as I will emphasize again before I conclude this column – that beyond the first issue, there is no specific order of importance to these guidelines.  Rather these are just a collection of issues we keep in mind when we are reviewing a paper.  So although I have numbered these points, don’t let the numbering suggest a ranking.

I would also add that although this column goes on for more than 2,000 words, this exercise merely serves as an introduction to the topic of model evaluation.  If you want more information you probably need to take a few classes in statistics and econometrics.

Okay, enough of the caveats, here are the guidelines:

1.  The theoretical foundation of the model

This is THE most important issue.  A regression model looks at the link between a dependent variable (what you are trying to explain) and an independent variable (what you think does the explaining).  The choice of independent variables must be guided by theory.

Why is theory so important? It’s important to remember that statistical analysis can tell us about correlations. Causation is inferred from theory.  So if you have no theory, it’s not clear what your model is telling us.  And without a theory, it’s not clear what other researchers and/or decision-makers would ultimately do with your results.  In essence (with very few exceptions), if you ain’t got a theory, you ain’t got a model.

So having a sound theory is important.  But you have to do more if you want to see your work published.  Once it’s clear your work has a clear theoretical foundation, we then consider a series of other issues.

2.  Statistical significance and economic significance

After we understand the theory being tested — and assuming the model was estimated correctly (correct functional form, no econometric issues like heteroskedasiticy, autocorrelation, multicollinearity, etc…) — we tend to look at the statistical significance of the estimated coefficients next.  By statistical significance I mean whether or not the estimated coefficients (for the independent variables) are different from zero.  If they are, then the researcher might have found something. 

As we note in The Wages of Wins, Deirdre McCloskey has spent quite a bit of time hammering home the point that statistical significance is not the final word in evaluating a coefficient.  We also want to know the economic significance of the results.  What this means is that we want to know how much each coefficient matters.

To illustrate, we find that salary is linked in basketball to scoring, rebounds, blocked shots, and assists.  These factors have a statistically significant impact on how much a player is paid. But what is important is which factor matters most.  And when we look at economic significance, we find that scoring is the most important determinant of player compensation.   

Historically, as McCloskey point out, many economists stopped with a discussion of statistical significance.  But I think more recently – probably due to the persistence of McCloskey — economists are increasingly taking the time to talk about economic significance.

3.  Robustness of results

If we estimated the model differently would we get substantially different results? When we ask this question we are asking if the results are “robust.”

To illustrate this point, consider the Price-Wolfers paper. This study found evidence of racial bias in how NBA referees called personal fouls.  What was impressive about this result was that the finding of racial bias remained even after the authors tried a multitude of specifications.  When we see such “robustness” our confidence in the results tends to rise.

4. Explanatory power

Non-economists and students tend to get most excited about explanatory power or R2.  R2, for those who are not statistically inclined, is a simple ratio comparing the amount of variation in the dependent variable your model explained to the amount of variation that exists in the dependent variable.  For example, if you have an R2 of 0.55, then your model explains 55% of the variation in the dependent variable.  This means that 45% of the variation in the dependent variable was not explained.

If you are comparing two models that are seeking to explain the same thing, you might consider explanatory power in deciding which model you prefer.  However, and this is a point emphasized in most econometric textbooks, just because one model can explain more does not mean it’s the preferred model. A model with a higher explanatory power, but with no theory behind it or serious econometric problems, will be rejected by researchers and reviewers.

I would also note that I have had models published with an R2 that was less than 10%. In fact, the aforementioned Price-Wolfers paper presented several models, ranging in explanatory power from 1% to 28%.  Again, explanatory power tells us something, but it doesn’t tell us everything.

5. Forecasting power out of sample

Related to explanatory power is the issue of forecasting.  Explanatory power evaluates a model within the sample examined.  Forecasting considers how well a model does out of sample.

To be honest, few papers make any effort to forecast out of sample.  Still, this is an issue one could consider in evaluating a model.  Basically, if a model has high explanatory power, but cannot forecast, then we suspect the results are entirely specific to the sample tested.   This tells us that results cannot be generalized and are perhaps of little interest.   Furthermore, for decision-makers, an inability to forecast with a model is a significant problem.  After all, decisions are about the future. If your model only applies to the past, then it probably cannot help a person make better choices about the future.

6. Simplicity of model

Which is a “better” model, NBA Efficiency, Game Score, or Player Efficiency Rating (PER)?  Before answering, let me quickly note that Game Score is John Hollinger’s simple version of PER.  The calculation of Game Score is as follows.

Game Score = Points + 0.4*Field goals made + 0.7*Offensive rebounds + 0.3*Defensive rebounds + Steals + 0.7*Assists + 0.7*Blocked shots – 0.7*Field goal attempts – 0.4*Free throws missed – 0.4*Personal fouls – Turnovers

Game Score does not make any adjustments like PER does for team pace.  It simply adds and subtracts the box score statistics, according to the various weights Hollinger has chosen. 

Is PERs “better” than Game Score? One might think the more complicated model (PERs) must be “better”.  But when we look at player performance last season we find a 0.99 correlation between a player’s per-minute Game Score and his PER.  In essence, each model tells the same story.

What about Game Score and NBA Efficiency?  Here is the calculation for NBA Efficiency.

NBA Efficiency = Points + Rebounds + Steals + Assists + Blocked Shots – Turnovers – All Missed Shots

Game Score is a bit more complicated than NBA Efficiency.  But when we look at player performance in 2006-07, we again find a 0.99 correlation between these two measures. So these models are also telling us the same story.

Given these similarities, which should we prefer?  In general we prefer a simple model – or what we call a parsimonious model – to a complex model.  In other words, we don’t add complexity to a model unless it helps us do something (i.e. solve an econometric problem, improve explanatory power, etc…).  If two models give essentially the same answer, the simple model will be easier to both explain and work with.  So that is our choice.

Comparing NBA Efficiency to Wins Produced, Win Score, and PAWS

My list of factors is not complete.  For example, I left out whether the work is actually “interesting” or “important.”  Still, I think this list gives a good set of issues to consider when looking at a model.  To illustrate how this checklist can be used, let’s think about NBA Efficiency.

On the plus side, this model is quite simple (issue #6) and it does forecast itself (although not team wins) fairly well (issue #5).  But it’s theoretical foundation is weak (issue #1) and a team’s NBA Efficiency only explains 23% of team wins (issue #4).  Now if you change the dependent variable from team wins to player salaries, the usefulness of NBA Efficiency increases.  Although NBA Efficiency doesn’t do a good job of explaining wins, it does a nice job of explaining free agent salary.  In essence, NBA Efficiency (and models like this) does a nice job of telling us about perceptions of performance.  It just has some problems if our objective is to measure the impact a player has on wins.

Now consider Wins Produced, Win Score, and PAWS.  Wins Produced is derived from the relationship between wins and offensive and defensive efficiency.  This relationship is based on sound theory, so there’s a clear theoretical foundation.  And as end note 37 from Chapter Six of The Wages of Wins indicates, the results are also somewhat robust.  One can make a few minor alterations to the model reported in Berri (1999) and get the same results.

Beyond theory and robustness, we also see that Wins Produced does a nice job explaining current wins and allows us to forecast.  Again, those are positives.  Finally, one can derive two simple models from Wins Produced.   Both Win Score and PAWS are quite easy to calculate and essentially tell the same story as Wins Produced. 

So when we consider issues like theory, robustness, explanatory power, forecasting, and simplicity, the Wages of Wins basketball measures appear to be at least “reasonable” models.  In contrast, NBA Efficiency appears to have a few shortcomings. 

A Few Last Issues to Consider

It’s important to note that these are guidelines, not a score-sheet.  In other words, it’s not like each factor is assigned a value and we simply choose models that have the highest scores.  No, this process is not quite that precise. All we do is review the model with these issues in mind.  Then if we think the model falls short with respect to one or more issues we suggest the author make some changes (or if the problems are quite severe, reject the paper).

Given the lack of precision in the process, it’s not uncommon for two researchers looking at the same model to come to different conclusions.  Certainly authors of a paper often disagree with the reviewers.  The blind, peer review process, is not a process that gives us the “truth” or a definitive answer.  It is, though, better than any alternative we have come up with to review research (by the way, I should dedicate an entire post to why reviews need to be blind and generated by an author’s peers).   

I would emphasize that although the process often leads to disagreement, in my experience few people take the disagreements personally.  Some of my best friends in economics produce work I am not too sure about (and they probably feel the same way about my stuff).  Often non-researchers characterize what we do as a “contest.”  In reality, research is really just a cool way to spend your time.

Let me close by emphasizing that there is one factor I did not include in the guidelines.  Often I see people stating that model A is “good” because it fits what everyone already believes.  Or model B is “bad” because it doesn’t.  I have serious problems with such an approach.  In general, in evaluating a model we do not consider whether the model confirms or rejects our pre-conceived notions.  So with respect to NBA Efficiency, Wins Produced, or any other model designed to measure player performance, we do not consider whether the model returns a ranking consistent with our prior beliefs. 

I often argue, if we let prior beliefs determine whether we accept or reject a model, then we might as well skip this entire process and simply take a vote.  And although voting is nice, I am not convinced the democratic process produces results that trump solid statistical analysis.

– DJ

The following column offers even more on the importance of simplicity in building a model: Talking with Henry Abbott and a Comment about Model Building

Our research on the NBA was summarized HERE.

The Technical Notes at wagesofwins.com provides substantially more information on the published research behind Wins Produced and Win Score

Wins Produced, Win Score, and PAWSmin are also discussed in the following posts:

Simple Models of Player Performance

Wins Produced vs. Win Score

What Wins Produced Says and What It Does Not Say

Introducing PAWSmin — and a Defense of Box Score Statistics

Posted in: General