Propensity score matching

In the statistical analysis of observational data, propensity score matching (PSM) is a statistical matching technique that attempts to estimate the effect of a treatment, policy, or other intervention by accounting for the covariates that predict receiving the treatment. PSM attempts to reduce the bias due to confounding variables that could be found in an estimate of the treatment effect obtained from simply comparing outcomes among units that received the treatment versus those that did not. Paul R. Rosenbaum and Donald Rubin introduced the technique in 1983.^[1]

The possibility of bias arises because a difference in the treatment outcome (such as the average treatment effect) between treated and untreated groups may be caused by a factor that predicts treatment rather than the treatment itself. In randomized experiments, the randomization enables unbiased estimation of treatment effects; for each covariate, randomization implies that treatment-groups will be balanced on average, by the law of large numbers. Unfortunately, for observational studies, the assignment of treatments to research subjects is typically not random. Matching attempts to reduce the treatment assignment bias, and mimic randomization, by creating a sample of units that received the treatment that is comparable on all observed covariates to a sample of units that did not receive the treatment.

The "propensity" describes how likely a unit is to have been treated, given its covariate values. The stronger the confounding of treatment and covariates, and hence the stronger the bias in the analysis of the naive treatment effect, the better the covariates predict whether a unit is treated or not. By having units with similar propensity scores in both treatment and control, such confounding is reduced.

For example, one may be interested to know the consequences of smoking. An observational study is required since it is unethical to randomly assign people to the treatment 'smoking.' The treatment effect estimated by simply comparing those who smoked to those who did not smoke would be biased by any factors that predict smoking (e.g.: gender and age). PSM attempts to control for these biases by making the groups receiving treatment and not-treatment comparable with respect to the control variables.

YouTube Encyclopedic

1/5
Views:
111 400
91 336
228 192
49 922
30 177

Transcription

Overview

PSM is for cases of causal inference and confounding bias in non-experimental settings in which: (i) few units in the non-treatment comparison group are comparable to the treatment units; and (ii) selecting a subset of comparison units similar to the treatment unit is difficult because units must be compared across a high-dimensional set of pretreatment characteristics.^{[citation needed]}

In normal matching, single characteristics that distinguish treatment and control groups are matched in an attempt to make the groups more alike. But if the two groups do not have substantial overlap, then substantial error may be introduced. For example, if only the worst cases from the untreated "comparison" group are compared to only the best cases from the treatment group, the result may be regression toward the mean, which may make the comparison group look better or worse than reality.^{[citation needed]}

PSM employs a predicted probability of group membership—e.g., treatment versus control group—based on observed predictors, usually obtained from logistic regression to create a counterfactual group. Propensity scores may be used for matching or as covariates, alone or with other matching variables or covariates.

General procedure

1. Estimate propensity scores, e.g. with logistic regression:

Dependent variable: Z = 1, if unit participated (i.e. is member of the treatment group); Z = 0, if unit did not participate (i.e. is member of the control group).
Choose appropriate confounders (variables hypothesized to be associated with both treatment and outcome)
Obtain an estimation for the propensity score: predicted probability p or the log odds, log[p/(1 − p)].

2. Match each participant to one or more nonparticipants on propensity score, using one of these methods:

Nearest neighbor matching
Optimal full matching: match each participants to unique non-participant(s) so as to minimize the total distance in propensity scores between participants and their matched non-participants. This method can be combined with other matching techniques.
Caliper matching: comparison units within a certain width of the propensity score of the treated units get matched, where the width is generally a fraction of the standard deviation of the propensity score
Radius matching: all matches within a particular radius are used -- and reused between treatment units.
Kernel matching: same as radius matching, except control observations are weighted as a function of the distance between the treatment observation's propesnity score and control match propensity score. One example is the Epanechnikov kernel. Radius matching is a special case where a uniform kernel is used.

Mahalanobis metric matching in conjunction with PSM
Stratification matching
Difference-in-differences matching (kernel and local linear weights)
Exact matching

3. Check that covariates are balanced across treatment and comparison groups within strata of the propensity score.

Use standardized differences or graphs to examine distributions
If covariates are not balanced, return to steps 1 or 2 and modify the procedure

4. Estimate effects based on new sample

Typically: a weighted mean of within-match average differences in outcomes between participants and non-participants.
Use analyses appropriate for non-independent matched samples if more than one nonparticipant is matched to each participant

Formal definitions

Basic settings

The basic case^[1] is of two treatments (numbered 1 and 0), with N independent and identically distributed random variables subjects. Each subject i would respond to the treatment with $r_{1i}$ and to the control with $r_{0i}$ . The quantity to be estimated is the average treatment effect: $E[r_{1}]-E[r_{0}]$ . The variable $Z_{i}$ indicates if subject i got treatment ( $Z_{i}=1$ ) or control ( $Z_{i}=0$ ). Let $X_{i}$ be a vector of observed pretreatment measurements (or covariates) for the ith subject. The observations of $X_{i}$ are made prior to treatment assignment, but the features in $X_{i}$ may not include all (or any) of the ones used to decide on the treatment assignment. The numbering of the units (i.e.: i = 1, ..., N) are assumed to not contain any information beyond what is contained in $X_{i}$ . The following sections will omit the i index while still discussing the stochastic behavior of some subject.

Strongly ignorable treatment assignment

Let some subject have a vector of covariates X (i.e.: conditionally unconfounded), and some potential outcomes r₀ and r₁ under control and treatment, respectively. Treatment assignment is said to be strongly ignorable if the potential outcomes are independent of treatment (Z) conditional on background variables X. This can be written compactly as

r_{0},r_{1}\perp Z\mid X

where $\perp$ denotes statistical independence.^[1]

Balancing score

A balancing score b(X) is a function of the observed covariates X such that the conditional distribution of X given b(X) is the same for treated (Z = 1) and control (Z = 0) units:

Z\perp X\mid b(X).

The most trivial function is $b(X)=X$ .

Propensity score

A propensity score is the probability of a unit (e.g., person, classroom, school) being assigned to a particular treatment given a set of observed covariates. Propensity scores are used to reduce confounding by equating groups based on these covariates.

Suppose that we have a binary treatment indicator Z, a response variable r, and background observed covariates X. The propensity score is defined as the conditional probability of treatment given background variables:

e(x)\ {\stackrel {\mathrm {def} }{=}}\ \Pr(Z=1\mid X=x).

In the context of causal inference and survey methodology, propensity scores are estimated (via methods such as logistic regression, random forests, or others), using some set of covariates. These propensity scores are then used as estimators for weights to be used with Inverse probability weighting methods.

Main theorems

The following were first presented, and proven, by Rosenbaum and Rubin in 1983:^[1]

The propensity score $e(x)$ is a balancing score.
Any score that is 'finer' than the propensity score is a balancing score (i.e.: $e(X)=f(b(X))$ for some function f). The propensity score is the coarsest balancing score function, as it takes a (possibly) multidimensional object (X_i) and transforms it into one dimension (although others, obviously, also exist), while $b(X)=X$ is the finest one.
If treatment assignment is strongly ignorable given X then:

It is also strongly ignorable given any balancing function. Specifically, given the propensity score:

(r_{0},r_{1})\perp Z\mid e(X).

For any value of a balancing score, the difference between the treatment and control means of the samples at hand (i.e.: ${\bar {r}}_{1}-{\bar {r}}_{0}$ ), based on subjects that have the same value of the balancing score, can serve as an unbiased estimator of the average treatment effect: $E[r_{1}]-E[r_{0}]$ .

Using sample estimates of balancing scores can produce sample balance on X

Relationship to sufficiency

If we think of the value of Z as a parameter of the population that impacts the distribution of X then the balancing score serves as a sufficient statistic for Z. Furthermore, the above theorems indicate that the propensity score is a minimal sufficient statistic if thinking of Z as a parameter of X. Lastly, if treatment assignment Z is strongly ignorable given X then the propensity score is a minimal sufficient statistic for the joint distribution of $(r_{0},r_{1})$ .

Graphical test for detecting the presence of confounding variables

Judea Pearl has shown that there exists a simple graphical test, called the back-door criterion, which detects the presence of confounding variables. To estimate the effect of treatment, the background variables X must block all back-door paths in the graph. This blocking can be done either by adding the confounding variable as a control in regression, or by matching on the confounding variable.^[2]

Disadvantages

PSM has been shown to increase model "imbalance, inefficiency, model dependence, and bias," which is not the case with most other matching methods.^[3] The insights behind the use of matching still hold but should be applied with other matching methods; propensity scores also have other productive uses in weighting and doubly robust estimation.

Like other matching procedures, PSM estimates an average treatment effect from observational data. The key advantages of PSM were, at the time of its introduction, that by using a linear combination of covariates for a single score, it balances treatment and control groups on a large number of covariates without losing a large number of observations. If units in the treatment and control were balanced on a large number of covariates one at a time, large numbers of observations would be needed to overcome the "dimensionality problem" whereby the introduction of a new balancing covariate increases the minimum necessary number of observations in the sample geometrically.

One disadvantage of PSM is that it only accounts for observed (and observable) covariates and not latent characteristics. Factors that affect assignment to treatment and outcome but that cannot be observed cannot be accounted for in the matching procedure.^[4] As the procedure only controls for observed variables, any hidden bias due to latent variables may remain after matching.^[5] Another issue is that PSM requires large samples, with substantial overlap between treatment and control groups.

General concerns with matching have also been raised by Judea Pearl, who has argued that hidden bias may actually increase because matching on observed variables may unleash bias due to dormant unobserved confounders. Similarly, Pearl has argued that bias reduction can only be assured (asymptotically) by modelling the qualitative causal relationships between treatment, outcome, observed and unobserved covariates.^[6] Confounding occurs when the experimenter is unable to control for alternative, non-causal explanations for an observed relationship between independent and dependent variables. Such control should satisfy the "backdoor criterion" of Pearl.^[2]

Implementations in statistics packages

R: propensity score matching is available as part of the MatchIt,^[7]^[8] optmatch,^[9] or other packages.
SAS: The PSMatch procedure, and macro OneToManyMTCH match observations based on a propensity score.^[10]
Stata: several commands implement propensity score matching,^[11] including the user-written psmatch2.^[12] Stata version 13 and later also offers the built-in command teffects psmatch.^[13]
SPSS: A dialog box for Propensity Score Matching is available from the IBM SPSS Statistics menu (Data/Propensity Score Matching), and allows the user to set the match tolerance, randomize case order when drawing samples, prioritize exact matches, sample with or without replacement, set a random seed, and maximize performance by increasing processing speed and minimizing memory usage.
Python: PsmPy, a library for propensity score matching in python

References

^ ^a ^b ^c ^d Rosenbaum, Paul R.; Rubin, Donald B. (1983). "The Central Role of the Propensity Score in Observational Studies for Causal Effects". Biometrika. 70 (1): 41–55. doi:10.1093/biomet/70.1.41.
^ ^a ^b Pearl, J. (2000). Causality: Models, Reasoning, and Inference. New York: Cambridge University Press. ISBN 978-0-521-77362-1.
^ King, Gary; Nielsen, Richard (2019-05-07). "Why Propensity Scores Should Not Be Used for Matching". Political Analysis. 27 (4): 435–454. doi:10.1017/pan.2019.11. hdl:1721.1/128459. ISSN 1047-1987. | link to the full article (from the author's homepage)
^ Garrido MM, et al. (2014). "Methods for Constructing and Assessing Propensity Scores". Health Services Research. 49 (5): 1701–20. doi:10.1111/1475-6773.12182. PMC 4213057. PMID 24779867.
^ Shadish, W. R.; Cook, T. D.; Campbell, D. T. (2002). Experimental and Quasi-experimental Designs for Generalized Causal Inference. Boston: Houghton Mifflin. ISBN 978-0-395-61556-0.
^ Pearl, J. (2009). "Understanding propensity scores". Causality: Models, Reasoning, and Inference (Second ed.). New York: Cambridge University Press. ISBN 978-0-521-89560-6.
^ Ho, Daniel; Imai, Kosuke; King, Gary; Stuart, Elizabeth (2007). "Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference". Political Analysis. 15 (3): 199–236. doi:10.1093/pan/mpl013.
^ "MatchIt: Nonparametric Preprocessing for Parametric Causal Inference". R Project. 16 November 2022.
^ Hansen, Ben B; Klopfer, Stephanie Olsen (2006). "Optimal Full Matching and Related Designs via Network Flows". Journal of Computational and Graphical Statistics. Informa UK Limited. 15 (3): 609–627. doi:10.1198/106186006x137047. ISSN 1061-8600. S2CID 10138048.
^ Parsons, Lori. "Performing a 1:N Case-Control Match on Propensity Score" (PDF). SUGI 29: SAS Institute. Retrieved June 10, 2016.{{cite web}}: CS1 maint: location (link)
^ Implementing Propensity Score Matching Estimators with STATA. Lecture notes 2001
^ Leuven, E.; Sianesi, B. (2003). "PSMATCH2: Stata module to perform full Mahalanobis and propensity score matching, common support graphing, and covariate imbalance testing". Statistical Software Components.
^ "teffects psmatch — Propensity-score matching" (PDF). Stata Manual.

Bibliography

Abadie, Alberto; Imbens, Guido W. (2006). "Large Sample Properties of Matching Estimators for Average Treatment Effects". Econometrica. 74 (1): 235–267. CiteSeerX 10.1.1.559.6313. doi:10.1111/j.1468-0262.2006.00655.x.
Leite, Walter L. (2017). Practical Propensity Score Methods using R. Washington, DC: Sage Publications. ISBN 978-1-4522-8888-8.
Austin, Peter C. (31 May 2011). "An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies". Multivariate Behavioral Research. 46 (3): 399–424. doi:10.1080/00273171.2011.568786. PMC 3144483. PMID 21818162.

Statistics

Descriptive statistics

Continuous data

Center	Mean Arithmetic Arithmetic-Geometric Cubic Generalized/power Geometric Harmonic Heronian Heinz Lehmer Median Mode
Dispersion	Average absolute deviation Coefficient of variation Interquartile range Percentile Range Standard deviation Variance
Shape	Central limit theorem Moments Kurtosis L-moments Skewness

Count data

Index of dispersion

Summary tables

Dependence

Graphics

Data collection

Study design	Effect size Missing data Optimal design Population Replication Sample size determination Statistic Statistical power
Survey methodology	Sampling Cluster Stratified Opinion poll Questionnaire Standard error
Controlled experiments	Blocking Factorial experiment Interaction Random assignment Randomized controlled trial Randomized experiment Scientific control
Adaptive designs	Adaptive clinical trial Stochastic approximation Up-and-down designs
Observational studies	Cohort study Cross-sectional study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Point estimation	Estimating equations Maximum likelihood Method of moments M-estimator Minimum distance Unbiased estimators Mean-unbiased minimum-variance Rao–Blackwellization Lehmann–Scheffé theorem Median unbiased Plug-in
Interval estimation	Confidence interval Pivot Likelihood interval Prediction interval Tolerance interval Resampling Bootstrap Jackknife
Testing hypotheses	1- & 2-tails Power Uniformly most powerful test Permutation test Randomization test Multiple comparisons
Parametric tests	Likelihood-ratio Score/Lagrange multiplier Wald

Specific tests

Z-test (normal) Student's t-test F-test
Goodness of fit	Chi-squared G-test Kolmogorov–Smirnov Anderson–Darling Lilliefors Jarque–Bera Normality (Shapiro–Wilk) Likelihood-ratio test Model selection Cross validation AIC BIC
Rank statistics	Sign Sample median Signed rank (Wilcoxon) Hodges–Lehmann estimator Rank sum (Mann–Whitney) Nonparametric anova 1-way (Kruskal–Wallis) 2-way (Friedman) Ordered alternative (Jonckheere–Terpstra) Van der Waerden test

Bayesian inference

Correlation	Pearson product-moment Partial correlation Confounding variable Coefficient of determination
Regression analysis	Errors and residuals Regression validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS)
Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression
Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Heteroscedasticity Homoscedasticity
Generalized linear model	Exponential families Logistic (Bernoulli) / Binomial / Poisson regressions
Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / Multivariate / Time-series / Survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality
Specific tests	Dickey–Fuller Johansen Q-statistic (Ljung–Box) Durbin–Watson Breusch–Godfrey
Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model (Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR)
Frequency domain	Spectral density estimation Fourier analysis Least-squares spectral analysis Wavelet Whittle likelihood

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time
Hazard function	Nelson–Aalen estimator
Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics
Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification
Social statistics	Actuarial science Census Crime statistics Demography Econometrics Jurimetrics National accounts Official statistics Population statistics Psychometrics
Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging

Least squares and regression analysis

Computational statistics

Correlation and dependence

Regression analysis

Regression as a
statistical model

Linear regression	Simple linear regression Ordinary least squares Generalized least squares Weighted least squares General linear model
Predictor structure	Polynomial regression Growth curve (statistics) Segmented regression Local regression
Non-standard	Nonlinear regression Nonparametric Semiparametric Robust Quantile Isotonic
Non-normal errors	Generalized linear model Binomial Poisson Logistic