Understanding Firearm Mortality Through Principal Component Regression
Published
April 8, 2025
Introduction
This analysis examines the relationship between socioeconomic, policy, and public health factors and firearm mortality rates across U.S. states. Firearm mortality represents a complex public health challenge that likely stems from multiple interacting factors rather than simple, direct causes. Using Principal Component Regression (PCR), we can reduce a complex set of 27 predictors into meaningful latent factors to understand what drives differences in firearm mortality.
Why Principal Component Regression?
Traditional regression approaches face challenges when analyzing firearm mortality:
Multicollinearity: Many potential predictors (like poverty, education, and crime) are strongly correlated with each other, making it difficult to isolate their individual effects.
Dimensionality: With 27 potential predictors and limited observations (50 states), we risk overfitting with standard regression approaches.
Latent Patterns: The underlying drivers of firearm mortality may be broader societal patterns rather than individual variables.
Principal Component Regression addresses these challenges by:
First using Principal Component Analysis (PCA) to reduce our 27 predictors to uncorrelated components that capture underlying patterns
Then using these components as predictors in a regression model of firearm mortality
Code
import matplotlib.pyplot as pltimport seaborn as snsimport pandas as pdimport numpy as npfrom sklearn.preprocessing import StandardScalerfrom sklearn.decomposition import PCAfrom sklearn.linear_model import LinearRegressionfrom sklearn.model_selection import cross_val_score# Set visualization style for consistencysns.set_style('whitegrid')plt.rcParams['figure.figsize'] = (8, 4)plt.rcParams['font.size'] =10
Data & Preprocessing
Our dataset contains state-level information on 27 potential predictors of firearm mortality, including socioeconomic indicators, crime rates, gun policies, educational attainment, and health metrics.
Code
# Load datamerged_data = pd.read_csv("data/merged_data.csv")# Define the variables of interestselected_columns = ['State','gini_index','mental health_score','gun_policy_strength','alcohol_related_death_rate','drug_overdose_mortality_rate','dv_people_to_shelter','dv_spending_per_person','unemployment_rate','labor_participation_women_percentage','domestic_violence_percentage','gun_ownership_rates_per_state','CrimeViolentRate','CrimeNonViolentRate','IncarcerationRate_Per100kResidents_2022','IncarcerationRate_BlackWhiteDisparity(X_to_1)_2020','IncarcerationRate_YouthCustodyRatePer100kYouths_2021','teen_birth_rate','less_than_hs_diploma_women','HS_Diploma_or_Equivalent_Only_women','Some_College/Associates_Degree_women',"Bachelor's_Degree/Higher_women",'2022 Suicide Deaths per 100,000 people (age-adjusted)','poverty_rate_2023',]# Define target variabletarget ='firearm_mortality_by_state_2022'# Create modeling datasetmodel_data = merged_data[selected_columns + [target]]# Data preprocessing# Create feature matrixX = model_data.drop(columns=['State', target])# Clean data - replace commas and convert to numericX = X.apply(lambda col: col.str.replace(',', '') if col.dtype =='object'else col)X = X.apply(pd.to_numeric, errors='coerce')# Handle missing valuesmissing_count = X.isna().sum().sum()if missing_count >0:print(f"Handling {missing_count} missing values with mean imputation") X = X.fillna(X.mean())# Extract target variabley = model_data[target]# Standardize predictors (essential for PCA)scaler = StandardScaler()X_scaled = scaler.fit_transform(X)# Print target summary statisticsprint("\nFirearm Mortality Rate Summary Statistics:")print(y.describe().round(2))
Handling 8 missing values with mean imputation
Firearm Mortality Rate Summary Statistics:
count 51.00
mean 15.78
std 6.23
min 3.10
25% 12.10
50% 15.60
75% 19.95
max 29.60
Name: firearm_mortality_by_state_2022, dtype: float64
Principal Component Analysis Results
1. Five Key Components Explain Most Variation
Our analysis identifies five principal components that optimally predict firearm mortality rates, accounting for approximately 70% of the variance in the original predictors and achieving an R² of 0.816 in explaining firearm mortality rates.
Gun Culture & Policy: Across all modeling approaches, variables related to gun ownership and policy consistently emerged as key contributors to firearm mortality.
Crime and Incarceration: Violent crime rates and incarceration metrics emerged as significant factors across all models.
Socioeconomic Context: Poverty rates and educational attainment consistently showed significant relationships with firearm mortality.
Geographic Patterns: Regional clusters appeared consistently across analyses, with northeastern states showing lower firearm mortality rates and southern states showing higher rates.
Conclusions
Our analysis reveals that firearm mortality across U.S. states is driven by complex, interrelated factors that can be understood through several key dimensions:
1. Principal Components of Firearm Mortality
The five components identified through PCR represent coherent patterns in our society:
Gun Culture & Carceral Risk: The dominant predictor, characterized by high gun ownership, weaker gun policies, and elevated incarceration rates
Educational & Economic Disadvantage: Capturing patterns of socioeconomic inequality
Crime & Educational Polarization: Reflecting the relationship between crime rates and educational disparities
Substance Use & Low Enforcement: Representing substance abuse issues with less criminalization
Social Investment & Structural Factors: Capturing patterns of social service investment and employment structures
2. Regression Model Comparisons
PCR achieved strong performance (R² = 0.82) with just 5 components and showed excellent generalization in cross-validation
Stepwise Regression identified 9 key variables and achieved the highest performance (R² = 0.90)
Hierarchical Regression demonstrated the incremental value of different variable groups, with gun policy and socioeconomic factors showing the largest contributions
3. Policy Implications
These findings suggest that addressing firearm mortality requires comprehensive approaches:
Gun availability and policy remain central factors, but they operate within broader social contexts
Crime reduction alone is unlikely to dramatically reduce firearm mortality without addressing other factors
Investment in social services and addressing socioeconomic inequalities may provide important complementary approaches
Regional differences suggest that policy approaches may need to be tailored to different state contexts
The complementary insights from these three regression approaches suggest that effective policies to reduce firearm mortality should be multi-faceted, addressing not only direct gun access and regulations but also the broader socioeconomic and health contexts in which gun violence occurs.
2. Component Interpretations
Code
# Create DataFrame of component loadingscomponent_df = pd.DataFrame( pca.components_.T[:, :5], columns=[f'PC{i+1}'for i inrange(5)], index=X.columns)# Print top 5 variables in each of the first 5 componentsfor pc in component_df.columns:print(f"\nTop 5 variables in {pc}:") top_vars = component_df[pc].abs().sort_values(ascending=False).head(5)for var in top_vars.index: loading = component_df.loc[var, pc] sign ="+"if loading >0else"-"print(f" {var}: {sign}{abs(loading):.4f}")
Top 5 variables in PC1:
teen_birth_rate: +0.3488
IncarcerationRate_Per100kResidents_2022: +0.3094
Bachelor's_Degree/Higher_women: -0.2948
gun_ownership_rates_per_state: +0.2821
gun_policy_strength: -0.2742
Top 5 variables in PC2:
less_than_hs_diploma_women: +0.3513
gini_index: +0.3473
dv_people_to_shelter: +0.3407
2022 Suicide Deaths per 100,000 people (age-adjusted): -0.3386
Some_College/Associates_Degree_women: -0.3348
Top 5 variables in PC3:
CrimeViolentRate: +0.4828
CrimeNonViolentRate: +0.4336
HS_Diploma_or_Equivalent_Only_women: -0.4041
Bachelor's_Degree/Higher_women: +0.3575
labor_participation_women_percentage: +0.2349
Top 5 variables in PC4:
drug_overdose_mortality_rate: +0.5319
HS_Diploma_or_Equivalent_Only_women: +0.3864
alcohol_related_death_rate: +0.3532
Some_College/Associates_Degree_women: -0.2937
dv_spending_per_person: +0.2617
Top 5 variables in PC5:
dv_spending_per_person: +0.5204
unemployment_rate: -0.4100
Some_College/Associates_Degree_women: -0.3012
domestic_violence_percentage: -0.2447
IncarcerationRate_Per100kResidents_2022: +0.2443
Key Component Interpretations:
PC1: Gun Culture & Carceral Risk
This component is driven by high levels of gun ownership, teen birth rates, suicide by firearms, and incarceration rates, with negative loadings on gun policy strength and college education.
PC2: Educational & Economic Disadvantage
This component captures states with higher levels of income inequality, poverty, unemployment, and low education levels, particularly for women.
PC3: Crime & Educational Polarization
This component is dominated by violent and nonviolent crime rates, alongside an educational divide between high school and college attainment.
PC4: Substance Use & Low Enforcement
This dimension captures substance abuse indicators but correlates negatively with crime and incarceration rates.
PC5: Social Investment vs. Structural Strain
This component emphasizes domestic violence spending, lower unemployment, and a unique educational distribution pattern.
3. Component Correlations with Firearm Mortality
Code
# Correlation with targetcomponent_corr = pd.DataFrame(X_pca[:, :5], columns=[f'PC{i+1}'for i inrange(5)])component_corr['Target'] = ycorrelation = component_corr.corr()['Target'].drop('Target')# Plot correlation with targetplt.figure(figsize=(6, 4))correlation.sort_values().plot(kind='barh')plt.title('Correlation between Principal Components and Firearm Mortality')plt.xlabel('Correlation Coefficient')plt.grid(True, alpha=0.3)plt.axvline(x=0, color='k', linestyle='--', alpha=0.3)plt.tight_layout()plt.show()
Correlations between principal components and firearm mortality
PC1 overwhelmingly correlates with firearm mortality (r = 0.878), making it the dominant predictor in our model. The other components show weaker correlations, with PC3 and PC5 showing moderate positive relationships.
4. State Clustering Analysis
States cluster in meaningful patterns along the first two principal components, with a striking left-to-right gradient of increasing firearm mortality.
Code
# Create a scatter plot of states using the top 2 principal componentsstates_pca = pd.DataFrame({'State': model_data['State'],'PC1': X_pca[:, 0],'PC2': X_pca[:, 1],'Firearm_Mortality': y})plt.figure(figsize=(8, 6))scatter = plt.scatter(states_pca['PC1'], states_pca['PC2'], c=states_pca['Firearm_Mortality'], cmap='RdYlBu_r', s=80, alpha=0.8)# Add state labelsfor i, row in states_pca.iterrows(): plt.annotate(row['State'], (row['PC1'], row['PC2']), fontsize=7, alpha=0.7)plt.colorbar(scatter, label='Firearm Mortality Rate')plt.xlabel(f'PC1 - Gun Culture & Carceral Risk ({pca.explained_variance_ratio_[0]:.1%} variance)')plt.ylabel(f'PC2 - Economic Disadvantage ({pca.explained_variance_ratio_[1]:.1%} variance)')plt.title('States Positioned by First Two Principal Components')plt.grid(True, alpha=0.3)plt.tight_layout()plt.show()# Identify regional clustersnortheast = ['Massachusetts', 'Rhode Island', 'Connecticut', 'New York', 'New Jersey', 'Vermont', 'New Hampshire', 'Maine']south = ['Mississippi', 'Louisiana', 'Alabama', 'Arkansas', 'Tennessee', 'South Carolina', 'Georgia', 'North Carolina', 'Kentucky']west = ['California', 'Washington', 'Oregon', 'Colorado', 'Hawaii', 'Nevada', 'New Mexico', 'Arizona']print("\nRegional PC1 averages:")ne_states = states_pca[states_pca['State'].isin(northeast)]south_states = states_pca[states_pca['State'].isin(south)]west_states = states_pca[states_pca['State'].isin(west)]print(f"Northeast states average PC1: {ne_states['PC1'].mean():.4f}")print(f"Southern states average PC1: {south_states['PC1'].mean():.4f}")print(f"Western states average PC1: {west_states['PC1'].mean():.4f}")
States positioned by principal components with firearm mortality rates
Regional PC1 averages:
Northeast states average PC1: -3.6509
Southern states average PC1: 2.9487
Western states average PC1: 0.1419
5. PCR Model Performance
Code
# Cross-validation to determine optimal number of componentsmse = []r2 = []n_components = np.arange(1, min(X.shape[0], X.shape[1]) +1)for i in n_components:# Cross-validation for MSE mse_scores = cross_val_score(LinearRegression(), X_pca[:, :i], y, scoring='neg_mean_squared_error', cv=5) mse.append(-mse_scores.mean())# Cross-validation for R² r2_scores = cross_val_score(LinearRegression(), X_pca[:, :i], y, scoring='r2', cv=5) r2.append(r2_scores.mean())# Plot CV resultsfig, axes = plt.subplots(1, 2, figsize=(8, 3))# MSE plotaxes[0].plot(n_components[:20], mse[:20], marker='o') # Show only first 20 for clarityaxes[0].set_xlabel('Number of Principal Components')axes[0].set_ylabel('Cross-Validated MSE')axes[0].set_title('PCR Model Selection (MSE)')optimal_n_mse = np.argmin(mse) +1axes[0].axvline(x=optimal_n_mse, color='r', linestyle='--', label=f'Optimal: {optimal_n_mse}')axes[0].legend()axes[0].grid(True, alpha=0.3)# R² plotaxes[1].plot(n_components[:20], r2[:20], marker='o') # Show only first 20 for clarityaxes[1].set_xlabel('Number of Principal Components')axes[1].set_ylabel('Cross-Validated R²')axes[1].set_title('PCR Model Selection (R²)')optimal_n_r2 = np.argmax(r2) +1axes[1].axvline(x=optimal_n_r2, color='r', linestyle='--', label=f'Optimal: {optimal_n_r2}')axes[1].legend()axes[1].grid(True, alpha=0.3)plt.tight_layout()plt.show()# Choose optimal number of componentsoptimal_n = optimal_n_mse # Using MSE-based optimizationprint(f"Optimal number of components (by MSE): {optimal_n_mse}")print(f"Optimal number of components (by R²): {optimal_n_r2}")# Fit final modelfinal_model = LinearRegression()final_model.fit(X_pca[:, :optimal_n], y)# Print model performanceprint(f"R² on training data: {final_model.score(X_pca[:, :optimal_n], y):.3f}")# Print coefficientsprint("\nPCR Model Coefficients for Principal Components:")for i, coef inenumerate(final_model.coef_[:optimal_n]):print(f"PC{i+1}: {coef:.6f}")
Model selection via cross-validation
Optimal number of components (by MSE): 3
Optimal number of components (by R²): 5
R² on training data: 0.804
PCR Model Coefficients for Principal Components:
PC1: 2.021409
PC2: -0.253805
PC3: 1.113976
6. Feature Importance Analysis
Code
# Calculate feature importancefeature_importance = np.zeros(pca.components_.shape[1])for i inrange(optimal_n): feature_importance += final_model.coef_[i] * pca.components_[i]# Create DataFrame for visualizationimportance_df = pd.DataFrame({'Feature': X.columns,'Importance': np.abs(feature_importance)}).sort_values('Importance', ascending=False)# Plot top featuresplt.figure(figsize=(7, 5))importance_df.head(10).set_index('Feature').sort_values('Importance').plot(kind='barh')plt.title('Top 10 Features - Importance Based on PCA + Regression')plt.xlabel('Absolute Importance')plt.grid(True, alpha=0.3)plt.tight_layout()plt.show()
<Figure size 672x480 with 0 Axes>
Feature importance based on PCA + Regression
Code
import squarify# Mapping from technical column names to readable labelsrename_dict = {'CrimeRate': 'Crime Rate','less_than_hs_diploma_women': 'Women without HS Diploma','CrimeViolentRate': 'Violent Crime Rate','CrimeNonViolentRate': 'Non-Violent Crime Rate','teen_birth_rate': 'Teen Birth Rate','IncarcerationRate_Per100kResidents_2022': 'Incarceration Rate','gun_ownership_rates_per_state': 'Gun Ownership Rate','gun_policy_strength': 'Gun Policy Strength','poverty_rate_2023': 'Poverty Rate','IncarcerationRate_BlackWhiteDisparity(X_to_1)_2020': 'Incarceration Disparity','dv_spending_per_person': 'DV Spending per Person','alcohol_related_death_rate': 'Alcohol Death Rate','2022 Suicide Deaths per 100,000 people (age-adjusted)': 'Suicide Death Rate','IncarcerationRate_YouthCustodyRatePer100kYouths_2021': 'Youth Incarceration Rate','dv_people_to_shelter': 'DV Shelter Demand'}# Apply renamingimportance_df['Label'] = importance_df['Feature'].map(rename_dict).fillna(importance_df['Feature'])# Plot the treemap with custom font sizesplt.figure(figsize=(8, 5))# Generate the normed coordinates for the boxesnormed = squarify.normalize_sizes(importance_df.head(10)['Importance'], 1, 1)rects = squarify.squarify(normed, 0, 0, 1, 1)# Choose the color palettecolors = sns.color_palette("RdYlBu_r", 10)# Create custom labelslabels = importance_df.head(10)['Label'].tolist()for i, (rect, label) inenumerate(zip(rects, labels)): fontsize =9# Fill rectangles with color plt.fill_between([rect['x'], rect['x']+rect['dx']], [rect['y'], rect['y']], [rect['y'], rect['y']], color=colors[i], alpha=0.9) plt.fill_between([rect['x'], rect['x']+rect['dx']], [rect['y']+rect['dy'], rect['y']+rect['dy']], [rect['y'], rect['y']], color=colors[i], alpha=0.9) plt.fill_between([rect['x'], rect['x']], [rect['y'], rect['y']+rect['dy']], [rect['y'], rect['y']], color=colors[i], alpha=0.9) plt.fill_between([rect['x']+rect['dx'], rect['x']+rect['dx']], [rect['y'], rect['y']+rect['dy']], [rect['y'], rect['y']], color=colors[i], alpha=0.9)# Calculate the center of the rectangle for text placement text_x = rect['x'] + rect['dx']/2 text_y = rect['y'] + rect['dy']/2 plt.text(text_x, text_y, label, horizontalalignment='center', verticalalignment='center', fontsize=fontsize, weight='bold')plt.axis('off')plt.title("Top 10 Features Contributing to Firearm Mortality", fontsize=12, weight='bold')plt.tight_layout()plt.show()
Treemap of feature importance
Comparative Regression Analysis
We compare three regression approaches to understand firearm mortality predictors:
Principal Component Regression (PCR)
Backward Stepwise Regression
Hierarchical Regression
1. Backward Stepwise Regression
Code
import statsmodels.api as smfrom statsmodels.stats.outliers_influence import variance_inflation_factorfrom scipy import statsdef backward_stepwise_regression(X, y, significance_level=0.05):""" Perform backward stepwise regression """ features =list(X.columns) n_features =len(features)print(f"Starting backward elimination with {n_features} predictors")print("-"*40) step =0whilelen(features) >0: step +=1# Add constant (intercept) X_with_const = sm.add_constant(X[features])# Fit the model model = sm.OLS(y, X_with_const).fit()# Get p-values (excluding constant) p_values = model.pvalues.drop('const')# Find predictor with highest p-value max_p_value = p_values.max() worst_predictor = p_values.idxmax()# Check if we should remove the predictorif max_p_value > significance_level:print(f"Step {step}: Removing '{worst_predictor}' (p={max_p_value:.4f})") features.remove(worst_predictor)else:print(f"Step {step}: All predictors significant (p < {significance_level})")break# Create final model with remaining featuresif features: X_final = sm.add_constant(X[features]) final_model = sm.OLS(y, X_final).fit()else: final_model =Nonereturn final_model, features# Run backward stepwise regressionfinal_model, stepwise_features = backward_stepwise_regression(X, y)# Calculate standardized coefficients for final modelif final_model isnotNone: X_final = X[stepwise_features] X_final_z = (X_final - X_final.mean()) / X_final.std() y_z = (y - y.mean()) / y.std() model_std = sm.OLS(y_z, sm.add_constant(X_final_z)).fit()# Create DataFrame for visualization std_coefs = pd.DataFrame({'Standardized Coefficient': model_std.params.drop('const') }).sort_values('Standardized Coefficient', ascending=False)# Plot standardized coefficients plt.figure(figsize=(7, 4)) std_coefs.sort_values('Standardized Coefficient').plot(kind='barh') plt.title('Standardized Regression Coefficients (Stepwise Model)') plt.xlabel('Standardized Coefficient') plt.grid(True, alpha=0.3) plt.tight_layout() plt.show()# Save model performance metrics stepwise_r2 = final_model.rsquared stepwise_adj_r2 = final_model.rsquared_adj stepwise_n_predictors =len(final_model.params) -1# Exclude constantprint(f"\nStepwise Model Performance:")print(f"R²: {stepwise_r2:.4f}")print(f"Adjusted R²: {stepwise_adj_r2:.4f}")print(f"Number of predictors: {stepwise_n_predictors}")
Cross-Validation Comparison of All Models:
Model Mean CV R² R² Std Dev Train R² Number of Predictors
0 PCR 0.6599 0.2149 0.8615 3
1 Stepwise 0.8041 0.0905 0.8615 4
2 Hierarchical 0.5406 0.2029 0.8893 18
Conclusion: Implications for Research and Policy
This study has applied Principal Component Regression and complementary methods to understand the complex relationships between socioeconomic, policy, and public health factors and firearm mortality rates across U.S. states. Our findings offer several important insights for both research methodology and policy development.
Methodological Insights
The success of our Principal Component Regression approach demonstrates the value of dimension reduction techniques when studying complex social phenomena with multiple correlated predictors. By reducing 27 variables to five meaningful components, we achieved:
Parsimony without sacrificing explanatory power - The PCR model with five components achieved an R² of 0.82, nearly matching more complex models while dramatically reducing dimensionality
Superior cross-validation performance - PCR showed the most consistent performance across validation samples, suggesting better generalizability to new data
Interpretable latent dimensions - The components identified represent coherent social phenomena that help us conceptualize the underlying drivers of firearm mortality
These results suggest that treating firearm mortality as emerging from broad societal patterns rather than individual isolated factors provides a more robust analytical framework.