Solutions to Project 1

Author

James W. Hesford

Background

In this project you will estimate the cost function for an economy lodging hotel in San Antonio, Texas. The case is set at the beginning of 2023 and you have been given: (1) monthly data on room rentals and cost; and, (2) daily data on temperature.

The Problem of Inflation

With data from many years, you must adjust expenses prior to 2022 to be equivalent to 2022 spending. You do this using the consumer price index (CPI), a measure obtained from the Bureau of Labor Statistics.

CPI for 2015 is 1.000 and, for 2022, CPI is 1.185. This means that, if someone spent $100 in 2015 for a bundle of goods, the same bundle would cost $118.50 in 2022. To convert costs prior to 2022, divide each monthly cost for a specific year by that year’s CPI, then multiply the result by the 2022 CPI. Monthly cost figures for 2022 are not adjusted, while 2015-2021 monthly costs are adjusted upward (since the 2022 CPI is greater than all prior years’ CPI). The formula for adjusted cost is CostAdj = Cost[i] * (CPI[2022]/CPI[i]), where i is the year. Table 1 below provides descriptive statistics for the cost and rentals data.

Table 1
variable N Mean SD Min Median Max
cost 96 56434.292 11589.2780 -985 55160 95011
costAdj 96 60783.135 12833.3642 -1066 61069 99901
rentals 96 2316.219 453.1665 1362 2277 3628

From the above table, note that the values for inflation-adjusted costs (costAdj) are higher than those of the actual costs. Mean monthly inflation-adjusted cost is $60,783 with a standard deviation of $12,833. It is worth nothing that at least one observation has a negative cost; since this is not economically possible, being the result of some accounting irregularity, some data transformation will be needed. Mean monthly room rentals is 2,316 with a standard deviation of 453.

In the case I mentioned that there were two issues with a long time series of data, the first of these being inflation. The second issue that must be considered is whether some sort of change occurred during the year that would impact the cost function (e.g., a change in technology). In this business, there has been no changes.

Visualizing the Relationship Between Rentals and Hotel Expenses

It is almost always helpful to plot the relationships between variables. In this setting, there is a single cost driver that management can influence: rooms rented. Accordingly, plot total expenses and rooms rented, and comment on the resulting graph.

Figure 1

In the plot above, inflation-adjusted costs are generally increasing in room rentals. Overall, this plot does confirm that monthly spending is related to monthly rentals.

However, there is a fairly large dispersion of the data points (dots) and several dots appear to be far removed from the overall grouping. Further, on the right side of the plot (above 3,200 rentals), 6 points appear to be lower than the trend suggested by the other data. The plot reveals that just one observation has a negative value.

With data points dispersed, rather than closely grouped on an imaginary line, I anticipate r-square – the proportion of variance in spending explained by room rentals – will be relatively low. This is normal — only textbook problems have high r-square values.

I count three unusual observations. Later, statistical analysis can confirm whether these values are extreme. There are numerous possible reasons for unusual values. For example, the hotel could have had a large repair expense in a month where accounting decided to expense the amount rather than capitalize it (as you will learn later, this means recording the amount as an asset, and recording depreciation expense in future years). An example of an unusual low expense could be due to the hotel being closed for a period of time for renovations. Another source of unusual observations can occur as a result of accounting errors in recording expenses (entering a wrong month, charging expenses to the wrong hotel, etc.). Another instance that might arise can come from a time period mismatch. For example, guests occupying the hotel use water and electricity. Assume utilities record consumption at the end of a month. These items are billed at the beginning of the next month and the company pays the bills 30 days later – fully two months after the room rentals were recorded. If rentals are correlated across months, the error may not be significant, but it will have an effect if the correlation, r is less than 1.

An Initial Model

As a second step, regress inflation-adjusted costs onto rentals. Remember, inflation-adjusted costs must be used or the costs will not be comparable. This means that, once inflation-adjusted cost is created, actual cost is no longer used. The regression results are provided below.

Table 2
Base Regression Model

Call:
lm(formula = costAdj ~ rentals, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-61537  -4452    739   5627  28940 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 28212.105   5982.651   4.716 8.35e-06 ***
rentals        14.062      2.535   5.546 2.67e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11200 on 94 degrees of freedom
Multiple R-squared:  0.2466,    Adjusted R-squared:  0.2386 
F-statistic: 30.76 on 1 and 94 DF,  p-value: 2.669e-07

With an F-statistic of 30.76 (df = 1, 94, p < .05), the model is statistically significant. R-square, as I predicted from Figure 1, is only 0.25. The monthly cost equation is:

Cost = $28,212 + $14.06 * Rentals

With p-values on both coefficients less than 0.05, the coefficients are statistically significant. The model implies monthly fixed cost is $28,212 and each additional room rental increases cost by $14.06. This is the hotel’s unit variable cost and, in a simple linear model, it is also the firm’s marginal cost. (Note that the intercept is not really the fixed cost since we do not have a relevant range that includes zero room rentals.)

Assessing the Model

The model is economically plausible as expenses should rise (fall) with increases (decreases) in rentals. Goodness of fit is, however, not very good. Only a small proportion of the variance in spending can be attributed to the cost driver. Further investigation is warranted. However, absent correlated omitted variables (you will see later this is false), the coefficient on rentals would be an unbiased estimate of unit variable cost.

Regression analysis has several key assumptions. Among these are that errors (residuals, which are the differences between actual costs and the values estimated by the model) are normally distributed, that the variance of the residuals is constant across the entire range of the data, and that errors are not autocorrelated (i.e., an error in a prior period is not correlated with the current period error). If these assumptions are violated, the results may not be valid. Fortunately, there are steps to correct for violations. We also assume the model is theoretically correct, that independent variables are measured without error, and so on.

Normality of the Residuals

In evaluating these assumptions there are both visual and statistical tests. This document reports both. The first of these are plots of the error terms, consisting of a histogram and a normal probability plot. (These plots are related and, over time, one can assess normality using just one plot).

In the Q-Q plot (Figure 2), points should fall on a diagonal line. Since the model is not perfect, deviations are expected. But observations relatively far from the line and isolated are unusual and cause the residuals to be non-normally distributed. As mentioned above, I suspected three observations were unusual, and those points are evident in the q-q plot (two points are in the lower left below line and one is in the upper right above the diagonal).

A histogram (Figure 3) shows the distribution of the residuals, a plot that should be more familiar to most. A normally-distributed variable appears as a bell-shaped, symmetrical distribution. Figure 3 reveals that residuals have a left skew and is not normally-distributed.

With plots, observers may differ in their interpretations and conclusions. Formal statistical tests have been created to assess whether a variable has a normal distribution. One such test is the Shapiro-Wilk test of normality, the results of which are shown below:


    Shapiro-Wilk normality test
W = 0.8647, p-value = 6.958e-08

The null hypothesis for this test assumes normality of the residuals. With a p-value less than 0.05, the null hypothesis of normality is rejected, and one must conclude that the residuals are not normally distributed. A key assumption of regression has been violated.

Homoskedasticity (Constant Variance)

A common way to evaluate the assumption of constant variance is to plot residuals vs. fitted (predicted) values. (This amounts to simply “tilting” Figure 1 so that the regression line becomes the horizontal axis.) See Figure 4 below.

With constant variance, a uniform band of dots will exist across the horizontal range of the plot. For the most part, this is what we see. Again, three points (2 below the horizontal axis, and one above it) may be outliers. There is also some indication that high values of actual cost are more likely to be below the axis (i.e., the predicted values of spending). Based on hundreds of regressions, I would conclude that the residuals are homoskedastic (i.e., of constant variance). The Breusch-Pagan (B-P) test provides an objective assessment of homoskedasticity:


    Studentized Breusch-Pagan test
BP = 0.41585, df = 1, p-value = 0.519

The null hypothesis of the B-P test is that errors are homoskedastic. With a test statistic B-P = 0.416 (df = 1, p = 0.519), we do reject the null hypothesis of homoskedasticity. The visual assessment is confirmed by this formal test.

Autocorrelated Errors

As with the other tests, we examine residuals visually (Figure 5) and with a statistical test.

 lag Autocorrelation D-W Statistic p-value
   1       0.3833909      1.202647       0
 Alternative hypothesis: rho != 0

Examining the residuals, it appears that positive errors are often followed by positive errors, and negative errors are often followed by negative errors. This suggests that errors are correlated across time. The Durbin-Watson statistic (1952) is a formal test of serially correlated errors, and the value of D-W is 2 when errors are uncorrelated. Values of D-W far from 2 suggest correlated errors. The null hypothesis is that errors are not serially correlated, and D-W = 1.203 (p < .05) strongly suggests correlated error. Again, this agrees with the visual assessment.

With residuals being non-normal and auto-correlated, the model must be re-specified.

Cost Behavior Over Time

The auto-correlated and non-normal errors suggested to me that important factors have been omitted from the model. Since these data have been collected over many years, I decided to examine the pattern of spending each year. Multiple boxplots are a good way to examine whether the distribution of, and level of, spending varies over time. See Figure 6.

The box plot, as you should recall from statistics (or an Internet search or AI), displays the distribution of a variable by depicting a box with “whiskers”. The lower horizontal line of the box represents the first quartile of the data, the upper line of the box represents the 3 quartile, and the thick line represents the median value. A normally distributed variable is symmetric, so its box should have the same distance from the median to each of the quartiles. The whiskers depicts the expected values of the values above and below the quartiles. Outliers are shown as open dots.

What is evident in the figure is that the spending in every given year is skewed (although some years have a positive skew and others a negative skew) and the median values in spending are shifting across time. This sort of behavior, and the extent of the shifts, is not what I would expect from a random process. It appears that these shifts are being caused by management, not just the random fluctuations due to sales demand, inflation, etc.

Recall from your knowledge of fixed costs, there are two types of discretionary fixed costs: committed and discretionary. Further recall that fixed costs are those costs that do not vary with respect to sales volumes (i.e., output). They may vary — in fact, they likely do so — but they do not vary with respect to output. Recall a variable cost is one that varies proportionally with respect to output.

These box plots suggest to me that there are annual changes in discretionary spending and failure to account for the changing fixed cost will result in a significant bias in estimating the regression model’s coefficients.

At this point, you may wonder: “Do you have an example that illustrate this behavior?” Yes, there are several. For example, let’s assume a manager needs to schedule maintenance and repairs. Some repairs must be done immediately for safety (e.g., loose stairs) or simple necessity (e.g., broken air conditioners in summer), whereas others may be deferred (e.g., painting the outside stairs or resurfacing the parking lot). Some expenditures “fall in the middle” (e.g., choosing not to clear a parking lot after a modest snowfall). What happens is that, in good years (i.e., exceeding the budget), the maintenance is done as needed. In a particularly good year, some projects might be undertaken early. In bad years, however, the company may not want to fall short of earnings forecasts to investors and senior management will ask other managers to undertake efforts to reduce spending. In this case, non-safety repairs are likely to be put off into the future. This is a form of earnings management since the firm does not want to surprise its investors or the market.

A Complete Model

Having worked on sustainability for the firm, I knew that energy expense was the second largest expense after labor. It is almost a certainty that temperature will be a factor in driving hotel costs. Accordingly, I obtained daily temperature data from the National Oceanic and Atmospheric Administration (NOAA). Their data records the daily temperature in terms of heating degree days (HDD) and cooling degree days (CDD). The reference temperature, as best I recall, is 65 degrees and on a day when the maximum temperature is, say, 85 degrees, NOAA will register 85 - 65 = 20 cooling degree days. On a cold day, when the temperature is just 40 degrees, NOAA will record 25 heating degree days. (This may not be entirely accurate, but is adequate for our use as a proxy for energy intensity that should be closely related to energy costs.) With financial data on a monthly basis, simply sum the daily measurements of CDD and HDD to obtain monthly figures. Since the energy use may differ between heating and cooling, we add both HDD and CDD to the regression model.

In addition to temperature, one must account for the annual shifts in fixed discretionary spending. To do this, add dummy variables for each year and monthly temperature data. Accordingly to the CEO, energy expense was the second-largest expense after labor. And it is hard to control as it is related to external events (i.e., the weather). And a brief study, or just thinking of your travel and that of friends, is that travel is probably correlated with seasons and, therefore, with weather.

As argued above, shifts in discretionary fixed expense will impact total spending. Weather should have an impact on cost and, if correlated with customer demand (i.e., rentals), leaving weather out of the cost estimation model would bias regression coefficients that are correlated with weather (as rentals are). The variable in the original model will is capturing the effects of the omitted variable(s) on total cost.

After creating dummy variables, and monthly weather variables (cdd and hdd), I obtained the following regression results:

Table 3
Full Model

Call:
lm(formula = costAdj ~ rentals + hdd + cdd + year.F, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-58602  -3460    245   4822  29648 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  47475.335   7444.658   6.377    9e-09 ***
rentals          3.642      3.104   1.173 0.243940    
hdd              2.672      8.231   0.325 0.746278    
cdd             29.315      8.384   3.496 0.000752 ***
year.F2016     515.404   4033.246   0.128 0.898618    
year.F2017    4237.299   4031.590   1.051 0.296226    
year.F2018     227.712   4051.306   0.056 0.955309    
year.F2019   -1780.737   4051.907  -0.439 0.661428    
year.F2020    -231.841   4038.493  -0.057 0.954355    
year.F2021  -11141.329   4057.669  -2.746 0.007367 ** 
year.F2022   -8814.619   4096.899  -2.152 0.034271 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9872 on 85 degrees of freedom
Multiple R-squared:  0.4706,    Adjusted R-squared:  0.4083 
F-statistic: 7.556 on 10 and 85 DF,  p-value: 1.696e-08

The regression model results are very different. Before commenting on this, and because the last model was not satisfactory, I begin with an assessment of the model’s assumptions: normality of residuals, homoskedasticity and non-autocorrelated errors.

Normality and Serial Autocorrelation of the Residuals

Figure 7 shows the normal probability plot of the residuals and, below the plot, the Shapiro-Wilk test of normality and Durbin-Watson test of serial autocorrelation.


    Shapiro-Wilk normality test
W = 0.79013, p-value = 0.002
 lag Autocorrelation D-W Statistic p-value
   1       0.2492142      1.489103   0.002
 Alternative hypothesis: rho != 0

The residuals are yet again not normal (W = 0.79, p < .05) and are serially autocorrelated (DW = 1.489, p < .05). The r-square is greatly improved over the initial model, although the model still explains less than half the variance in the hotel’s total cost. The regression line slope is also more in line with my expectations, but this time the estimate is surprisingly low and not significant at the conventional threshold (p = 0.12, one-tailed). The model is also economically plausible.

The temperature measures (hdd and cdd) must be included in the model because they are correlated with rentals and, without them, the coefficient estimate on rentals would be biased as it would capture the effect of rentals on cost, plus the impact of hdd and cdd through the correlations these variables have with rentals.

As mentioned above, dummy variables capture the impact of the change in discretionary fixed expenses, each year. With no dummy variable for the year 2015, this is the model’s baseline. Variations in fixed discretionary spending from 2016 - 2022, appear as the coefficients of the dummy variables year.F2016 - year.F2022. Non-significant coefficients of year.F2016 - year.F2020 indicate that the discretionary fixed expenses in years 2016 through 2022 are not significantly different from the costs incurred in 2015, despite the apparent rise and fall in years 2015-2020, as shown in Figure 6. Variations can be considered random, but not different. However, significant negative coefficients (p < .05) for year.F2021 and year.F2022 mean discretionary fixed expenses were significantly lower in those years. It is important that these variables be in the model, although one could consider removing the dummy variables for the first years of the time series. I choose to keep them in the model going forward.

Removing Outliers

At this point, with no other possible variables to add, I evaluated and removed observations with a standard deviation above 3 and below -3. I then re-estimated the regression model, with the following results:

Table 4
Two Outliers Removed

Call:
lm(formula = costAdj ~ rentals + hdd + cdd + year.F, data = dfReduced)

Residuals:
     Min       1Q   Median       3Q      Max 
-24067.1  -3111.6    308.2   3547.0  17197.0 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  46747.959   4855.552   9.628 3.63e-15 ***
rentals          3.592      2.021   1.777  0.07929 .  
hdd              9.135      5.381   1.698  0.09334 .  
cdd             28.457      5.440   5.231 1.24e-06 ***
year.F2016     543.651   2615.761   0.208  0.83587    
year.F2017    4306.479   2614.674   1.647  0.10333    
year.F2018    5993.473   2690.543   2.228  0.02861 *  
year.F2019   -4573.511   2683.719  -1.704  0.09209 .  
year.F2020    -166.300   2619.207  -0.063  0.94953    
year.F2021  -11109.137   2631.723  -4.221 6.18e-05 ***
year.F2022   -8925.723   2657.246  -3.359  0.00118 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6402 on 83 degrees of freedom
Multiple R-squared:  0.6692,    Adjusted R-squared:  0.6294 
F-statistic: 16.79 on 10 and 83 DF,  p-value: 3.764e-16

A look at the results shows much better results. With 2 observations removed, r-squared increased to a respectable 0.67, and the important variables of interest are all significant (one-tail tests, p < .05). Before discussing the results, we assess the model.

Assessing the Model

The model is economically plausible as expenses have the expected relationships (+) with the continuous independent variables (rentals, hdd and cdd). Goodness of fit, as just mentioned, is good. About two-thirds of total spending can be explained by this model. I now turn to the statistical tests of normality, homoskedasticity and autocorrelated errors.


    Shapiro-Wilk normality test
W = 0.96421, p-value = 0.01126


    Studentized Breusch-Pagan test
BP = 19.377, df = 10, p-value = 0.03573
 lag Autocorrelation D-W Statistic p-value
   1      0.07205388      1.808166    0.06
 Alternative hypothesis: rho != 0

These tests indicate that the residuals are no longer serially autocorrelated but, with the removal of the two outliers, the third observation I thought was a bit unusual is, in fact, an outlier in this new model (with a standardized residual < -3). I would remove observation 50 and re-do the analysis. Using additional criteria for evaluating influential observations (e.g., Cook's distance [d] and hat values [h]) would have identified three influential data points in the last step rather than the 3 identified after the last regression.

Removing Another Outlier

Removing the additional observation with the large negative residual would be a good thing to do.


Table 5
Final Model
Call: lm(formula = costAdj ~ rentals + hdd + cdd + year.F, data = dfReduced) Residuals: Min 1Q Median 3Q Max -18325.6 -3101.5 -15.1 3208.1 14748.3 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 44770.890 4400.868 10.173 3.40e-16 *** rentals 5.039 1.851 2.722 0.007929 ** hdd 6.483 4.889 1.326 0.188437 cdd 24.023 5.005 4.800 7.03e-06 *** year.F2016 564.569 2358.866 0.239 0.811440 year.F2017 4267.611 2357.897 1.810 0.073972 . year.F2018 5772.979 2426.798 2.379 0.019691 * year.F2019 -2252.342 2475.003 -0.910 0.365471 year.F2020 -91.710 2362.027 -0.039 0.969123 year.F2021 -10896.843 2373.728 -4.591 1.58e-05 *** year.F2022 -8532.831 2397.877 -3.558 0.000623 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 5773 on 82 degrees of freedom Multiple R-squared: 0.709, Adjusted R-squared: 0.6735 F-statistic: 19.98 on 10 and 82 DF, p-value: < 2.2e-16


    Shapiro-Wilk normality test
W = 0.99134, p-value = 0.8092


    Studentized Breusch-Pagan test
BP = 15.353, df = 10, p-value = 0.1197
 lag Autocorrelation D-W Statistic p-value
   1      0.04978568       1.84267    0.09
 Alternative hypothesis: rho != 0

The plots and statistical tests above do not reject the null hypotheses of normality of residuals, homoskedasticity of residuals, or independence of residuals. Further, r-square is quite good (r2 = 0.67). The model is satisfactory.

Implausible Coefficients?

None of the regression coefficients are implausible, so no new models are required.

Conclusion

The results of the three models are shown in Table 6 below. Since I am not interested in the discretionary fixed expenses or the impact of temperature on costs, these coefficients have been omitted from the following table:

Table 6
Comparison of Regression Models

                            Basic                  Full              Outliers Removed
rentals                   14.062***                3.642                  5.039**         
                           (2.535)                (3.104)                 (1.851)        
                                                                                         
Year Dummies                 Yes                    Yes                     Yes

Temperature Controls         Yes                    Yes                     Yes

Constant                28,212.100***          47,475.330***           46,747.960***     
                         (5,982.651)            (7,444.658)             (4,855.552)      
-----------------------------------------------------------------------------------------
N                             96                     96                     94           
R2                          0.247                  0.471                   0.669         
Adjusted R2                 0.239                  0.408                   0.629         
Res. Std. Error      11,198.510 (df = 94)   9,871.510 (df = 85)     6,402.090 (df = 83)  
F Statistic         30.763*** (df = 1; 94) 7.556*** (df = 10; 85) 16.794*** (df = 10; 83)
-----------------------------------------------------------------------------------------
Notes: *p<0.1; **p<0.05; ***p<0.01. Standard errors in parentheses.

Each of the models was previously discussed, so no discussion is needed in terms of economic plausibility, goodness of fit or the magnitude of the regression line slope. Here I briefly compare across the models.

Development of these models, from a simple regression to a full regression model with both continuous and categorical (dummy) variables is interesting. This approach should demonstrate the importance of getting the model specification correct, having a theory to build the model, and the importance of testing a model’s assumptions. When violations occur, it is important to review the data and review the model, and make adjustments. Removing two outliers substantially improved the results in terms of coefficient significance levels and, therefore, the r-square (i.e., goodness of fit).

The work I did to produce this report was done in R. You can replicate this work in Excel, but it is tedious, time-consuming, and error-prone. Further, it is not reproducible research. Again, with this solution key I want to give you a complete picture of the analysis. The many details here were not used in my assessment of the work submitted.