Medical Expenses OLS Analysis

Motivation

Medical expenses at the individual level vary enormously. Some people spend nothing in a given year, others spend tens or hundreds of thousands of dollars. For a semester-long project in my econometrics course, I set out to study which observable individual characteristics predict total annual medical expenses, and how much of the variation those characteristics can actually explain.

The initial hypothesis was that higher annual income would be associated with higher medical expenses, on the reasoning that higher-income individuals can afford more doctor visits, more specialists, and more out-of-pocket care. As it turned out, the data disagreed. This article walks through the full analysis: data cleaning, a simple linear regression, a multiple linear regression with an interaction term, a baseline specification, and diagnostic tests for functional form misspecification (RESET) and heteroskedasticity (Breusch-Pagan).

Data

The data come from the IPUMS Health Surveys collection, a harmonized set of U.S. health microdata. The working dataset contains 2,016 observations and 15 variables per individual. Categorical variables include gender, marital status, race, education, and several insurance indicators (has a usual place of care, has health insurance, has private insurance, Medicaid, Medicare). Numeric variables include age, annual income, total medical expenses, number of doctor visits, number of ER visits, and nights spent in the hospital.

During cleaning, I dropped the ID column and removed 150 individuals with zero reported annual income, representing 6.9% of the original 2,166 observations. These are largely students and retirees without reportable earnings, and including them would anchor the income distribution at zero in a way that distorts the income-based comparisons I wanted to study.

The cleaned sample contains 933 males and 1,083 females. About 41.8% are currently married and 58.2% are not. There are no minors; the sample splits into 1,562 working-age adults (18 to 64) and 454 seniors (65+). The racial composition is 68.2% White, 21.3% Black, 7.3% Asian, 2.5% multiracial, and 0.7% Native American. Education ranges from less than high school (18.5%) to graduate degree (16.7%), with high school graduates (27.2%), some college (21.0%), and bachelor’s degrees (16.6%) filling the rest.

Summary statistics for the numerical variables are shown below.

Variable	Min	Max	Median	Mean	SD
Age	18	99	51.0	50.68	16.73
Annual Income ($)	8	260,182	26,913.5	38,632.70	37,565.19
Total Medical Expenses ($)	0	423,121	1,469.0	6,924.11	19,116.14
Number of Doctor Visits	0	195	2.0	4.18	7.42
Number of ER Visits	0	8	0.0	0.24	0.68
Number of Nights in Hospital	0	44	0.0	0.38	2.16

Two features stand out immediately. First, annual income and total medical expenses both have means far above their medians, a signature of heavy right-skew. Second, the standard deviation of total medical expenses ($19,116) is nearly three times its mean, indicating extreme dispersion driven by a small number of high-cost individuals.

The histograms below confirm this.

Histogram of annual income — Figure 1. Distribution of annual income. Mode near $10K, long right tail.

Histogram of total medical expenses — Figure 2. Distribution of total medical expenses. Nearly all mass below $20K, with a long right tail of outliers.

Both distributions are compatible with the usual stylized facts of income and healthcare spending: most people cluster at modest values, with a long right tail of outliers.

Simple Linear Regression: Income Only

The first model regresses total medical expenses on annual income alone. The Pearson correlation between the two variables is -0.07: weak and, counter to the hypothesis, negative. The scatterplot of the raw relationship makes the negative slope visually apparent.

Scatter of medical expenses vs annual income — Figure 3. Total medical expenses vs. annual income. Highest expenses concentrate at low incomes.

Estimating the simple linear regression

MedExp_i = β₀ + β₁ Income_i + ε_i

by OLS gives the following results.

Variable	Coefficient	Std. Error	t value	p value
Intercept	~7,830	—	—	—
Annual Income	-0.0234	0.00786	-2.979	0.003

F-statistic: 8.88 (1, 2014 df), p-value = 0.003. R² = 0.0044. Residual standard error ≈ $19,080.

The coefficient on income is -0.0234, meaning that a $1 increase in annual income is associated with about a $0.023 decrease in total medical expenses, or roughly a $23 decrease per $1,000 of additional income. The coefficient is statistically significant at the 5% level (p = 0.003), and the F-test rejects the null of no relationship.

Scatter with OLS regression line — Figure 4. Simple linear regression fit overlaid on the scatterplot. The slope is slightly negative.

The effect is statistically real but economically tiny. The R² of 0.0044 means income explains less than half of one percent of the variance in medical expenses. The residual standard error of about $19,080 dwarfs the median spend of $1,469. As a standalone predictor, income is close to useless.

The negative sign also flips the hypothesis on its head. A plausible explanation is that income is picking up unobserved health status: in this sample, lower-income individuals may be sicker on average and therefore consume more care. Without controls, income is partly proxying for health.

Multiple Linear Regression with Interaction

To recover a more credible picture, I estimate a multiple linear regression that adds demographic and utilization controls, plus an interaction between income and doctor visits. The specification is

MedExp_i = β₀ + β₁ Income_i + β₂ (Income_i × DocVisits_i) + β₃ Age_i + β₄ Married_i + β₅ Education_i + β₆ Gender_i + β₇ Insurance_i + β₈ Race_i + ε_i

The motivation for the interaction is that income and healthcare utilization may jointly amplify spending: a higher-income individual who also sees a doctor frequently could spend disproportionately more than either variable alone would predict. A positive coefficient on the interaction would support that hypothesis. Excluded reference categories are female (gender), Asian (race), not married (marital status), no health insurance, and bachelor’s degree (education).

Estimation by OLS on 2,016 observations yields:

Variable	Coefficient	Std. Error	t value	p value
Intercept	-2,699	2,409	-1.108	0.268
Annual Income	-0.0294	0.0137	-2.144	0.032
Doctor Visits	914.1	66.02	13.845	< 2e-16
Income × Doctor Visits	0.00104	0.0015	0.675	0.500
Age	93.92	25.00	3.757	0.0002
Married (Yes)	-1,564	825.8	-1.894	0.058
Education: Graduate Degree	477.0	1,359	0.351	0.726
Education: High School	420.8	1,264	0.333	0.739
Education: Less than High School	2,392	1,399	1.710	0.087
Education: Some College	1,174	1,315	0.893	0.372
Gender (Male)	389.9	810.9	0.481	0.631
Health Insurance (Yes)	2,639	1,360	1.941	0.052
Race: Black	-1,718	1,719	-0.999	0.318
Race: Multiracial	-1,593	2,865	-0.556	0.578
Race: Native American	713.1	4,916	0.145	0.885
Race: White	-589.6	1,543	-0.382	0.702

F-statistic: 27.39, p-value < 0.001. R² = 0.1704. Adjusted R² = 0.1642. Residual SE ≈ $17,480.

A few things jump out.

Doctor visits dominate. The coefficient is $914.1 per additional visit (p < 2e-16). Evaluated with the interaction at Income = $30,000, the marginal effect of one additional doctor visit is about $945.27; at Income = $60,000 it is $976.44. The effect is large and precisely estimated regardless of where we evaluate it.

Age is a strong second. Each additional year of age is associated with about $93.92 more in total medical expenses (p = 0.0002), holding all else constant. Across the sample’s age range (18 to 99), that is an economically meaningful gradient.

Income remains statistically significant but economically small, and still negative. With controls, a $1 increase in annual income is associated with about a $0.0294 decrease in medical expenses. Evaluated at 5 doctor visits, a $1,000 increase in income implies roughly a $24.20 decrease in medical expenses; at 10 doctor visits, roughly $19.00.

The interaction does not carry its weight. The coefficient on Income × DocVisits is 0.00104 with p = 0.500. The direction hints at the income-doctor-visits amplification story (a higher number of visits slightly weakens the negative income effect), but the confidence interval easily contains zero.

Marginal significance cluster. Married (p = 0.058), less-than-high-school (p = 0.087), and has-health-insurance (p = 0.052) all fall between 5% and 10%. Married individuals spend about $1,564 less than unmarried individuals on average; those with less than a high school education spend about $2,392 more than bachelor’s-degree holders; individuals with health insurance spend about $2,639 more than those without (consistent with insured individuals actually accessing care rather than foregoing it).

Gender and race are not significant. Once we control for income, doctor visits, age, marital status, education, and insurance, none of the race coefficients and the male coefficient clear any conventional threshold.

The joint F-test strongly rejects the null that all slope coefficients are zero (F = 27.39, p < 0.001). But the R² of 0.17 means 83% of the variance in medical expenses is still unexplained. That is a big gap.

Baseline Specification (Interaction Dropped)

Because the interaction term does not approach significance, I re-estimate without it and treat this as the baseline specification for further diagnostics:

MedExp_i = β₀ + β₁ Income_i + β₂ DocVisits_i + β₃ Age_i + β₄ Married_i + β₅ Education_i + β₆ Gender_i + β₇ Insurance_i + β₈ Race_i + ε_i

OLS results:

Variable	Coefficient	Std. Error	t value	p value
Intercept	-2,887	2,387	-1.209	0.227
Annual Income	-0.0244	0.0115	-2.115	0.035
Doctor Visits	938.6	55.16	17.015	< 0.001
Age	95.37	24.90	3.830	< 0.001
Married (Yes)	-1,573	825.6	-1.905	0.057
Education: Graduate Degree	492.7	1,355	0.363	0.717
Education: High School	422.2	1,263	0.334	0.738
Education: Less than High School	2,398	1,399	1.714	0.087
Education: Some College	1,179	1,315	0.897	0.370
Gender (Male)	352.7	808.9	0.436	0.663
Health Insurance (Yes)	2,603	1,358	1.916	0.056
Race: Black	-1,649	1,716	-0.961	0.337
Race: Multiracial	-1,549	2,864	-0.541	0.589
Race: Native American	764.1	4,915	0.155	0.877
Race: White	-529.1	1,540	-0.344	0.731

The coefficients on the core variables (income, doctor visits, age) are almost unchanged. Doctor visits actually tighten up to a t-value of 17.015 without the collinear interaction term. The baseline retains essentially all the information from the fuller model without the distraction of a non-significant interaction.

Functional Form: RESET Test

Even with a richer control set, OLS assumes the conditional expectation is linear in the predictors. To test this, I run Ramsey’s RESET test by adding powers of the fitted values (ŷ², ŷ³, ŷ⁴) as additional regressors and jointly testing whether their coefficients are zero.

Variable	Coefficient	Std. Error	t value	p value
... (baseline regressors)	...	...	...	...
ŷ²	1.638e-5	1.782e-5	0.919	0.358
ŷ³	-3.870e-10	3.310e-10	-1.169	0.243
ŷ⁴	1.545e-15	1.314e-15	1.176	0.240

Individually, none of the higher-order terms are significant, which is a textbook signature of multicollinearity among powers of the same fitted value. The joint exclusion-restriction F-test, however, is

F = 6.399 (3 df numerator), p-value < 0.001.

That rejects the null that the three higher-order coefficients are simultaneously zero at conventional significance levels. The reading: there is real curvature in the relationship between medical expenses and the predictors that the linear specification is missing. Likely candidates for non-linearity include age (plausibly concave or convex depending on life-cycle effects), doctor visits (diminishing or accelerating returns), and income (log transformations are standard in this literature). A non-linear specification is a natural next step.

Heteroskedasticity: Residual Plots and Breusch-Pagan

The other OLS assumption worth stress-testing is homoskedasticity. I plot baseline residuals against annual income and against number of doctor visits.

Residuals vs annual income — Figure 5. Residuals vs. annual income. Spread is larger at low incomes.

Residuals vs doctor visits — Figure 6. Residuals vs. number of doctor visits. Variance widens markedly with more visits.

The residuals-vs-income plot is not clearly fan-shaped, but the variance is noticeably larger for low-income observations, where the bulk of the sample lives. The residuals-vs-doctor-visits plot is more striking: at low visit counts residuals hug zero tightly, but at higher visit counts the positive and negative residuals balloon. That is a visual red flag for heteroskedasticity.

To test formally, I run the Breusch-Pagan regression, regressing squared baseline residuals on the same set of predictors:

ε̂_i² = α₀ + α₁ Income_i + α₂ DocVisits_i + α₃ Age_i + ... + v_i

F = 1.529, p-value = 0.0928.

At the conventional 5% level, this fails to reject the null of homoskedasticity. But the p-value is close to the 10% threshold, and it is inconsistent with the visual evidence from the doctor-visits residual plot. My reading is cautious: the formal test does not flag heteroskedasticity, but the sample is not unambiguously homoskedastic either. In practice, I would report heteroskedasticity-robust (HC1 or HC3) standard errors, or estimate the model using weighted least squares with weights proportional to an estimate of σ_i² as a function of doctor visits. Both remedies are straightforward and would insulate inference from any remaining heteroskedasticity.

Conclusion

Pulling the threads together:

The simple linear regression shows that annual income alone has a small, negative, statistically significant relationship with medical expenses, but explains only 0.44% of the variation. Income is not the main driver of healthcare spending, and to the extent it carries a negative sign, it is likely absorbing unobserved health status.

The multiple linear regression identifies doctor visits and age as the two dominant statistically significant predictors. Each additional doctor visit is associated with roughly $940 more in total medical expenses; each additional year of age with roughly $94 more. The interaction between income and doctor visits is not significant and is dropped from the baseline. Marital status, less-than-high-school education, and health insurance status are marginally significant (5%–10%); gender and race are not.

Diagnostic tests flag two caveats. The RESET test jointly rejects linearity in the fitted values (F = 6.399, p < 0.001), suggesting non-linear transformations of income, age, or doctor visits should be explored. The Breusch-Pagan test fails to reject homoskedasticity at 5% (p = 0.0928), but the doctor-visits residual plot is visually suggestive, and robust standard errors would be a prudent addition.

The overall model is jointly significant (F = 27.39, p < 0.001), but the R² of 0.17 is the most honest statistic in the whole analysis. It says that the vast majority of variation in medical spending is driven by things this dataset does not measure: specific chronic conditions, prescription drug regimens, the generosity of a person’s insurance plan, provider networks, geography, and so on. Within the variables we do have, the story is simple and intuitive: how often you see the doctor and how old you are dominate; income is a minor, possibly proxying, player.

Data source: IPUMS Health Surveys. Estimation in R using lm() (OLS). Diagnostics: Ramsey RESET test (powers 2–4 of fitted values) and Breusch-Pagan test on squared residuals.