Thursday, October 30, 2025

Beyond the P-Value

 

Beyond the P-Value: A Practical Guide to Understanding Advanced Statistics in Critical Care

Dr Neeraj Manikath , claude.ai

Abstract

Statistical literacy remains a fundamental competency for critical care physicians navigating an increasingly complex medical literature. While the limitations of p-values have been extensively debated, clinicians must also master the interpretation of confidence intervals, survival analyses, subgroup effects, confounding adjustment strategies, and emerging Bayesian methods. This review provides a practical framework for understanding these advanced statistical concepts, with emphasis on their clinical application, common pitfalls, and techniques to identify methodological misconduct. We present actionable pearls for the discerning reader to critically appraise quantitative evidence and avoid being misled by statistical manipulation.

Keywords: Critical care, statistics, confidence intervals, survival analysis, subgroup analysis, confounding, Bayesian statistics


Introduction

The p-value has long dominated medical statistics, serving as the gatekeeper between "significant" and "non-significant" findings. However, the American Statistical Association's 2016 statement emphasized that scientific conclusions should not be based solely on whether p crosses an arbitrary threshold[1]. For critical care physicians evaluating trials on sepsis bundles, ventilation strategies, or hemodynamic interventions, understanding what lies beyond the p-value is essential.

This review targets the practicing intensivist and critical care fellow, translating complex statistical concepts into clinically relevant interpretations. We focus on five domains frequently encountered but often misunderstood: confidence intervals, survival analysis, subgroup analyses, confounding adjustment, and Bayesian approaches.


Confidence Intervals: Quantifying Uncertainty

The Fundamental Concept

A confidence interval (CI) provides a range of plausible values for a treatment effect, typically at the 95% level. The correct interpretation: if we repeated the study infinitely under identical conditions, 95% of calculated CIs would contain the true population effect[2].

Pearl #1: The CI width reflects precision. A narrow CI (e.g., relative risk 0.75, 95% CI 0.71-0.79) indicates high precision, often from large sample sizes. A wide CI (e.g., relative risk 0.75, 95% CI 0.45-1.25) reveals substantial uncertainty.

Common Misinterpretations

Clinicians frequently misinterpret the CI as a probability statement about the true value: "There is a 95% probability the true effect lies within this range." This is incorrect under frequentist statistics—the true value is fixed; the CI either contains it or doesn't[3]. The 95% refers to the method's long-run reliability.

Clinical Application: The "Significance Fallacy"

Consider a trial comparing norepinephrine to vasopressin in septic shock with mortality RR = 0.88 (95% CI 0.77-1.01, p=0.06). Many would dismiss this as "negative" because p>0.05. However, the CI reveals that the true effect could range from a 23% risk reduction to a 1% risk increase. The point estimate suggests benefit; we simply lack precision to exclude the null[4].

Oyster #1: When the CI barely crosses 1.0 (or 0 for differences), don't immediately discard the intervention. Consider the plausible effect sizes and whether the lower bound still represents clinical benefit.

The Width Matters More Than You Think

In the ANDROMEDA-SHOCK trial evaluating peripheral perfusion-targeted resuscitation versus lactate-targeted resuscitation in septic shock, the mortality difference was -8.5% (95% CI -16.3% to -0.7%)[5]. While statistically significant, the wide CI indicates the true benefit could be anywhere from minimal to substantial—crucial information for guideline development.

Hack #1: Always report and interpret the CI boundaries. Ask: "What is the worst plausible effect in this CI, and would I still use this intervention if that were true?"


Survival Analysis: Beyond the Numbers

Understanding Kaplan-Meier Curves

Survival analysis handles time-to-event data where patients are censored (lost to follow-up, withdrawn, or event-free at study end). The Kaplan-Meier (KM) curve displays the probability of event-free survival over time[6].

Pearl #2: The KM curve's "steps" occur at event times. The height represents cumulative survival probability. The tick marks indicate censored observations—their presence is vital for assessing data completeness.

The Hazard Ratio Demystified

The hazard ratio (HR) quantifies the instantaneous risk of an event in one group relative to another. HR = 0.70 means the intervention group has 30% lower hazard of the event at any given time, assuming proportional hazards[7].

Critical caveat: HR does not equal relative risk. A HR of 0.70 does not mean 30% fewer events will occur overall—it describes the rate difference. With long follow-up and constant hazard reduction, this may approximate RR, but often it doesn't[8].

Crossing Curves: When Assumptions Break Down

The Cox proportional hazards model assumes the HR remains constant over time—the "proportional hazards assumption." When KM curves cross, this assumption is violated[9].

Clinical example: Early thrombolysis in acute respiratory distress syndrome (ARDS) might increase short-term bleeding deaths (curve initially favors control) but reduce late fibrotic deaths (curves cross, favoring intervention later). A single HR obscures this temporal dynamic.

Oyster #2: Curve crossing suggests treatment effects change over time. The overall HR may be meaningless. Look for time-stratified analyses or landmark analyses separating early from late effects[10].

What Doesn't Crossing Mean?

Conversely, curves that separate early and remain parallel support proportional hazards. Delayed separation suggests the intervention requires time to work (e.g., immunotherapy in cancer, potentially immunomodulation in sepsis).

Hack #2: Examine the "number at risk" table beneath KM curves. Rapid decline or imbalanced censoring between groups raises concerns about informative censoring or loss to follow-up that may bias results[11].

Log-Rank Test Limitations

The log-rank test compares entire survival distributions but gives equal weight to all time points. If most events occur early but the intervention prevents late events, the test may miss important effects. Weighted log-rank tests (e.g., Fleming-Harrington) can emphasize early or late divergence[12].


Subgroup Analyses: Fishing vs. Finding

The Multiple Comparisons Problem

Testing 20 subgroups at α=0.05 yields, on average, one false-positive finding by chance alone. This is the "multiple comparisons problem"[13]. Pharmaceutical companies and desperate researchers know that torturing data sufficiently will produce a "positive" subgroup.

Identifying Legitimate Subgroups

Pre-specification is paramount. Credible subgroup analyses are:

  1. Pre-specified in the protocol with biological rationale
  2. Small in number (typically ≤5 subgroups)
  3. Tested with formal interaction tests, not separate p-values per subgroup[14]

Pearl #3: The correct question is not "Is the treatment significant in subgroup X?" but rather "Is the treatment effect different in subgroup X versus Y?" This requires an interaction test (p-interaction).

The Interaction Test

Consider the ACURASYS trial of neuromuscular blockade in ARDS. Suppose mortality reduction appeared larger in PaO₂/FiO₂ <100 versus 100-150. Separate p-values might be 0.02 and 0.15, tempting readers to conclude benefit only in severe hypoxemia. However, if p-interaction = 0.40, the apparent difference is likely chance[15].

Oyster #3: Authors often highlight subgroup analyses with "significant" effects without reporting interaction tests. This is a red flag for data dredging. Demand to see p-interaction values.

The "Credibility Checklist"

When evaluating subgroup claims, apply the Sun et al. criteria[16]:

  • Was it pre-specified?
  • Is there biological plausibility?
  • Was it one of few tested subgroups?
  • Does the p-interaction meet a stringent threshold (e.g., <0.01)?
  • Has it been replicated in other studies?

Hack #3: Be especially skeptical of subgroup findings in trials with neutral primary results. Post-hoc subgroup "successes" often represent statistical fishing to salvage failed studies.

The Danger of Over-Interpretation

The ISIS-2 trial of aspirin post-myocardial infarction famously showed "benefit" in Geminis and Libras but not other astrological signs—a tongue-in-cheek demonstration of random variation in subgroups[17]. Yet similar biological implausibility doesn't stop authors from claiming, for instance, treatment works in men but not women without mechanistic explanation.


Adjustment for Confounding: Regression and Propensity Matching

Why We Adjust

Confounding occurs when a third variable is associated with both the exposure and outcome, distorting the true effect. In observational critical care studies—comparing ICU protocols, ventilation strategies, or fluid management—confounding is omnipresent[18].

Multivariate Regression: The Workhorse

Multivariate regression adjusts for multiple confounders simultaneously, estimating the independent effect of exposure on outcome. In logistic regression for binary outcomes, we obtain an adjusted odds ratio (aOR)[19].

Pearl #4: The adjusted estimate reflects the exposure-outcome association holding confounders constant. However, adjustment quality depends on measuring and including all relevant confounders—often impossible.

What Regression Cannot Do

Residual confounding: Unmeasured confounders (e.g., frailty, illness severity nuances) remain uncontrolled.

Collider bias: Adjusting for variables caused by both exposure and outcome (colliders) introduces bias[20].

Functional form misspecification: Assuming linear relationships when reality is non-linear yields incorrect adjustments.

Oyster #4: An observational study adjusted for 50 variables is not equivalent to an RCT. Authors cannot adjust away selection bias or unmeasured confounding. Treat adjusted observational results as hypothesis-generating, not definitive.

Propensity Score Matching: Mimicking Randomization

Propensity score matching (PSM) attempts to balance groups by matching patients with similar probabilities of receiving treatment based on observed covariates[21]. Each patient gets a propensity score (predicted probability of exposure), and exposed/unexposed patients with similar scores are matched.

Advantages:

  • Reduces dimensionality (many covariates → single score)
  • Achieves balance on measured confounders
  • Transparent assessment of covariate balance

Limitations:

  • Only adjusts for measured confounders—unmeasured confounding persists
  • Requires sufficient overlap in propensity scores (common support); extreme scores are unmatched and excluded, limiting generalizability[22]
  • Multiple matching algorithms yield different results (greedy, optimal, caliper widths)—a "researcher degree of freedom"

Hack #4: Check the propensity score overlap histogram. Poor overlap (groups have non-overlapping score distributions) indicates the exposed and unexposed are fundamentally different populations—matching won't save the analysis[23].

Instrumental Variable Analysis

An advanced technique uses an "instrument"—a variable associated with exposure but not directly with outcome—to estimate causal effects. Examples include Mendelian randomization using genetic variants or regional practice variation. This is beyond most clinical papers but increasingly appears in critical care health services research[24].

Pearl #5: No statistical adjustment method creates causality from observational data. Skepticism remains warranted even with sophisticated techniques.


Bayesian Statistics for the Clinician: A Paradigm Shift

The Frequentist Straitjacket

Traditional (frequentist) statistics answer: "If the null hypothesis were true, how often would we see data this extreme?" The p-value doesn't tell us what we want to know: "What is the probability the treatment works given our data?"

Bayesian statistics inverts this, directly estimating the probability of hypotheses given observed data[25].

The Bayesian Framework

Bayesian inference combines:

  1. Prior probability: Belief about treatment effect before seeing data
  2. Likelihood: How well data fit different effect sizes
  3. Posterior probability: Updated belief after seeing data

Formula: Posterior ∝ Likelihood × Prior

The posterior distribution provides probabilities for different effect sizes—directly answering clinical questions[26].

Clinical Example: Interpreting Bayesian Results

A Bayesian RCT of prone positioning in ARDS reports: "The posterior probability of any mortality reduction is 94%, and the probability of >5% absolute mortality reduction is 68%."

This directly quantifies evidence strength. Clinicians can ask: "What is the probability the NNT is <20?" and receive a probabilistic answer—far more intuitive than p-values and CIs[27].

Prior Selection: The Controversy

Critics argue prior selection is subjective. However:

  • Informative priors incorporate previous evidence (meta-analyses, pilot studies)
  • Skeptical priors assume small effects, requiring strong data to conclude benefit
  • Non-informative (vague) priors let data dominate

Pearl #6: Sensitivity analyses varying priors demonstrate result robustness. If conclusions change dramatically with different priors, evidence is weak regardless of approach[28].

Practical Advantages in Critical Care

  1. Interim analyses without penalty: Frequentist multiple testing inflates type I error; Bayesian updating doesn't require adjustment
  2. Small sample studies: Bayesian methods handle sparse data better, providing probability distributions rather than unstable p-values
  3. Stopping rules: Trials can stop for high posterior probability of benefit or futility
  4. Incorporation of external data: Priors formalize use of existing evidence

Oyster #5: Bayesian trials require pre-specified priors and decision thresholds. Post-hoc prior selection to favor desired conclusions is as problematic as p-hacking[29].

Interpreting Posterior Probabilities

Consider a sepsis trial reporting: "Posterior probability of HR <1.0 is 89%." This means 89% of the posterior distribution indicates benefit. Whether this justifies practice change depends on thresholds and clinical context—no magical cutoff replaces judgment[30].

Hack #5: Examine the entire posterior distribution, not just summary probabilities. A 90% probability of benefit sounds convincing, but if 80% of that distribution indicates HR 0.95-0.99 (minimal effect), enthusiasm should be tempered.

Bayesian vs. Frequentist: Complementary, Not Competing

Both frameworks answer different questions. Bayesian methods suit decision-making under uncertainty; frequentist methods control long-run error rates. The critical care literature increasingly includes both—learn to read each on its own terms[31].


Practical Pearls and Hacks: A Summary

Pearl #1: CI width indicates precision; don't ignore wide CIs that cross the null.

Pearl #2: KM curve tick marks and "number at risk" tables reveal censoring patterns.

Pearl #3: Subgroup analyses require interaction tests, not separate p-values.

Pearl #4: Regression adjusts for measured confounders only—residual confounding persists.

Pearl #5: No statistical method creates causality from observational data.

Pearl #6: Bayesian sensitivity analyses with varying priors demonstrate evidence robustness.

Hack #1: Ask, "Would I use this intervention if the true effect were at the CI's lower bound?"

Hack #2: Examine KM curve "number at risk" tables for informative censoring.

Hack #3: Skeptically appraise subgroups in neutral trials—likely fishing expeditions.

Hack #4: Check propensity score overlap histograms for fundamental group differences.

Hack #5: Examine full Bayesian posterior distributions, not just summary probabilities.


Conclusion

Statistical sophistication separates the discerning critical care clinician from the passive consumer of medical literature. P-values alone provide insufficient evidence for clinical decision-making. Confidence intervals quantify uncertainty, survival analyses capture temporal treatment dynamics, subgroup analyses demand rigorous pre-specification and interaction testing, confounding adjustment has inherent limitations, and Bayesian methods offer intuitive probabilistic inference.

The intensivist armed with these tools can identify methodological flaws, resist statistical manipulation, and synthesize quantitative evidence appropriately. As critical care evolves toward personalized medicine and adaptive trial designs, statistical literacy becomes not merely academic but essential to optimal patient care.

The next time you encounter a "significant" p-value, pause. Look beyond. Ask about confidence intervals, examine survival curves, demand interaction tests for subgroups, scrutinize adjustment strategies, and consider Bayesian interpretations. Your patients—and the integrity of medical science—deserve nothing less.


References

  1. Wasserstein RL, Lazar NA. The ASA statement on p-values: context, process, and purpose. Am Stat. 2016;70(2):129-133.

  2. Greenland S, Senn SJ, Rothman KJ, et al. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016;31(4):337-350.

  3. Morey RD, Hoekstra R, Rouder JN, Lee MD, Wagenmakers EJ. The fallacy of placing confidence in confidence intervals. Psychon Bull Rev. 2016;23(1):103-123.

  4. Amrhein V, Greenland S, McShane B. Scientists rise up against statistical significance. Nature. 2019;567(7748):305-307.

  5. Hernández G, Ospina-Tascón GA, Damiani LP, et al. Effect of a resuscitation strategy targeting peripheral perfusion status vs serum lactate levels on 28-day mortality among patients with septic shock: the ANDROMEDA-SHOCK randomized clinical trial. JAMA. 2019;321(7):654-664.

  6. Rich JT, Neely JG, Paniello RC, Voelker CC, Nussenbaum B, Wang EW. A practical guide to understanding Kaplan-Meier curves. Otolaryngol Head Neck Surg. 2010;143(3):331-336.

  7. Spruance SL, Reid JE, Grace M, Samore M. Hazard ratio in clinical trials. Antimicrob Agents Chemother. 2004;48(8):2787-2792.

  8. Hernán MA. The hazards of hazard ratios. Epidemiology. 2010;21(1):13-15.

  9. Bellera CA, MacGrogan G, Debled M, de Lara CT, Brouste V, Mathoulin-Pélissier S. Variables with time-varying effects and the Cox model: some statistical concepts illustrated with a prognostic factor study in breast cancer. BMC Med Res Methodol. 2010;10:20.

  10. Dafni U. Landmark analysis at the 25-year landmark point. Circ Cardiovasc Qual Outcomes. 2011;4(3):363-371.

  11. Altman DG, De Stavola BL, Love SB, Stepniewska KA. Review of survival analyses published in cancer journals. Br J Cancer. 1995;72(2):511-518.

  12. Harrington DP, Fleming TR. A class of rank test procedures for censored survival data. Biometrika. 1982;69(3):553-566.

  13. Rothman KJ. No adjustments are needed for multiple comparisons. Epidemiology. 1990;1(1):43-46.

  14. Wang R, Lagakos SW, Ware JH, Hunter DJ, Drazen JM. Statistics in medicine—reporting of subgroup analyses in clinical trials. N Engl J Med. 2007;357(21):2189-2194.

  15. Papazian L, Forel JM, Gacouin A, et al. Neuromuscular blockers in early acute respiratory distress syndrome. N Engl J Med. 2010;363(12):1107-1116.

  16. Sun X, Briel M, Walter SD, Guyatt GH. Is a subgroup effect believable? Updating criteria to evaluate the credibility of subgroup analyses. BMJ. 2010;340:c117.

  17. ISIS-2 (Second International Study of Infarct Survival) Collaborative Group. Randomised trial of intravenous streptokinase, oral aspirin, both, or neither among 17,187 cases of suspected acute myocardial infarction: ISIS-2. Lancet. 1988;2(8607):349-360.

  18. Lederer DJ, Bell SC, Branson RD, et al. Control of confounding and reporting of results in causal inference studies. Guidance for authors from editors of respiratory, sleep, and critical care journals. Ann Am Thorac Soc. 2019;16(1):22-28.

  19. Localio AR, Margolis DJ, Berlin JA. Relative risks and confidence intervals were easily computed indirectly from multivariable logistic regression. J Clin Epidemiol. 2007;60(9):874-882.

  20. Hernán MA, Hernández-Díaz S, Robins JM. A structural approach to selection bias. Epidemiology. 2004;15(5):615-625.

  21. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70(1):41-55.

  22. Austin PC. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behav Res. 2011;46(3):399-424.

  23. Stuart EA. Matching methods for causal inference: a review and a look forward. Stat Sci. 2010;25(1):1-21.

  24. Swanson SA, Hernán MA. The challenging interpretation of instrumental variable estimates under monotonicity. Int J Epidemiol. 2018;47(4):1289-1297.

  25. Goodman SN. Toward evidence-based medical statistics. 1: The P value fallacy. Ann Intern Med. 1999;130(12):995-1004.

  26. Kruschke JK, Liddell TM. The Bayesian New Statistics: hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychon Bull Rev. 2018;25(1):178-206.

  27. Ryan EG, Harrison EM, Pearse RM, Gates S. Perioperative haemodynamic therapy for major gastrointestinal surgery: the effect of a Bayesian approach to interpreting the findings of a randomised controlled trial. BMJ Open. 2019;9(3):e024256.

  28. Spiegelhalter DJ, Abrams KR, Myles JP. Bayesian Approaches to Clinical Trials and Health-Care Evaluation. Chichester: John Wiley & Sons; 2004.

  29. Berry SM, Carlin BP, Lee JJ, Muller P. Bayesian Adaptive Methods for Clinical Trials. Boca Raton: CRC Press; 2010.

  30. Lewis RJ, Angus DC. Time for clinicians to embrace their inner Bayesian? Reanalysis of results of a clinical trial of extracorporeal membrane oxygenation. JAMA. 2018;320(21):2208-2210.

  31. Goligher EC, Tomlinson G, Hajage D, et al. Extracorporeal membrane oxygenation for severe acute respiratory distress syndrome and posterior probability of mortality benefit in a post hoc Bayesian analysis of a randomized clinical trial. JAMA. 2018;320(21):2251-2259.

No comments:

Post a Comment

Biomarker-based Assessment for Predicting Sepsis-induced Coagulopathy and Outcomes in Intensive Care

  Biomarker-based Assessment for Predicting Sepsis-induced Coagulopathy and Outcomes in Intensive Care Dr Neeraj Manikath , claude.ai Abstr...