- Perspective
- Open access
- Published:
Human judgment error in the intensive care unit: a perspective on bias and noise
Critical Care volume 29, Article number: 86 (2025)
Abstract
Background
In the Intensive Care Unit (ICU), clinicians frequently make complex, high-stakes judgments, where inaccuracies can profoundly affect patient outcomes. This perspective examines human judgment error in ICU settings, specifically bias (systematic error) and noise (random error). While past research has emphasized bias, we explore the role of noise in clinical decision making and its mitigation.
Main body
System noise refers to unwanted variability in judgments that should ideally be identical. This variability stems from level noise (variability in clinicians’ average judgments), stable pattern noise (variability in clinicians’ responses to specific patient characteristics), and occasion noise (random, within-clinician variability). Two strategies to reduce noise are the use of algorithms and the averaging of independent judgments.
Conclusion
Recognizing and addressing noise in clinical decision making is essential to enhancing judgment accuracy in critical care. By implementing effective noise reduction strategies, clinicians can reduce errors and improve patient outcomes, ultimately advancing the quality of care delivered in ICU settings.
Introduction
In the intensive care unit (ICU), clinicians frequently make high-stakes decisions that rely heavily on human judgment. Any errors in judgment can have profound consequences for patient outcomes, underscoring the need to minimize them. Judgment error, defined as the difference between a clinician’s prediction and the actual clinical outcome, consists of two distinct components: bias and noise [1].
Bias refers to systematic error, often caused by cognitive biases [2], such as anchoring or availability bias [3], but also includes discriminatory biases against various social groups [4]. For instance, pain assessments may be consistently underestimated for taller individuals compared to shorter ones. However, when specific biases are not shared equally among clinicians, they contribute to random variability or noise. Noise represents random error [1]. For example, noise occurs when different clinicians make varying judgments for the same patient, a phenomenon widely observed across various medical fields [5,6,7].
Recent medical research has brought critical attention to cognitive biases in human judgment [2, 8] and discriminatory biases in artificial intelligence (AI) models [9,10,11]. However, improving judgment accuracy in the ICU requires addressing not only bias—whether in humans or AI—but also the often-overlooked issue of human judgment noise. This is crucial because reducing either noise or bias by the same magnitude results in an equivalent reduction in overall error, although the strategies for mitigating them differ [1]. Accordingly, this perspective explores different types of human judgment noise, as shown in Fig. 1, and proposes two strategies for its reduction: using algorithms and averaging independent judgments.
Types of human judgment noise
System noise. The ICU operates as a system designed to manage a diverse range of patients, implicitly assuming that clinicians are interchangeable in their decision making. However, this assumption is often undermined by disagreements among clinicians evaluating the same patients. This is called system noise: unwanted variability in judgments that should ideally be identical [1]. For example, in a recent pilot study,Footnote 1 20 ICU clinicians were asked to estimate the probability of mortality (on a scale of 0% to 100%) for the same patient cases. As the second column of Table 1 demonstrates, there is variability in their estimates for the same patients, that is, system noise. System noise can be further divided into two components that explain this variability: level noise and pattern noise.
Level noise. Level noise refers to the variability in the average judgment of individuals [1]. For example, some clinicians may consistently favor surgical intervention, while others may be less inclined, regardless of patient specifics. Level noise is also evident in the last column of Table 1; clinicians differ in their average mortality estimates across cases. These differences in individual judgment averages contribute to system noise. However, they do not fully explain the variability in assessments for specific patients.
Stable pattern noise. Pattern noise refers to variability between clinicians that goes beyond level noise, arising from their differing responses to specific patient characteristics. Pattern noise can be divided into stable pattern noise and occasion noise. Stable pattern noise reflects stable patterns or interaction effects between a clinician and patient traits [1]. For instance, Clinician A might consistently assign higher mortality probabilities to patients with specific comorbidities compared to both their own average estimate and those of colleagues for similar cases.Footnote 2 This variability in judgments does not stem from clinicians being more optimistic or pessimistic overall, but from the distinct ways in which they weigh different patient factors [4].
Occasion noise. While level noise and stable pattern noise reflect systematic between-clinician variability, occasion noise captures random, within-clinician variability. Occasion noise occurs when the same clinician makes different judgments about identical cases depending on the occasion. Common sources of occasion noise include mood, weather, fatigue, and time of day [1]. Additionally, the second-to-last column of Table 1 displays occasion noise: when patient case 3 was unknowingly repeated as case 51, inconsistencies emerged within the clinicians' own judgments.
In summary, system noise stems from three different sources: level noise, stable pattern noise, and occasion noise. This variability undermines the reliability of clinical decision making, as this process should ideally remain consistent regardless of the clinician or random situational factors. The presence of system noise means that similarly situated patients may receive different mortality predictions and subsequent treatment simply based on which clinician assesses them, introducing an element of a "lottery" effect in the system [4]. Without deliberate strategies to mitigate human judgment noise, these random errors persist, compounding across individual patients rather than canceling each other out, ultimately impacting patient care and outcomes.
Reducing human judgment noise
Bias and noise together contribute to judgment error [1]. In clinical practice, recognizing these errors and knowing how to reduce them is essential for professional health care. In the ICU context, two potential strategies for reducing noise are: using algorithms (including AI models) and averaging independent judgments.
Algorithms. A wide range of judgments and decisions are made in the ICU, and estimating mortality is one example that highlights the challenges involved. Estimating mortality is particularly complex due to the highly variable nature of ICU patients. This variability poses significant challenges to the development of intuitive expertise, which relies on several key conditions: clear, timely feedback, stable relationships between predictors and outcomes, and the opportunity to learn these relationships [12]. These requirements may not always be fulfilled in the ICU considering, for example, the results of Cox et al. [13] who found that experienced physicians were not significantly more accurate than medical students in predicting ICU mortality. Especially in these situations, algorithms offer a promising solution for improving judgment accuracy [14,15,16,17].
A major advantage of algorithms is that they reduce noise in several ways. First, well-designed algorithms apply appropriate weights to predictive variables and ignore those without predictive validity. In contrast, humans generally struggle with effectively ignoring irrelevant variables [18, 19], contributing to noise. Secondly, and potentially more importantly, algorithms consistently apply the same weighting of variables across all cases. This is not true for human judges, who vary their weighting across cases, often in ways that introduce substantial noise rather than enhancing accuracy [20]. The importance of consistency is highlighted by research showing that linear models representing experts’ judgment policies outperform the experts they were derived from [20] and that even “improper” models—such as those with unit or random weights—can outperform experts due to their consistency [21, 22].
Translating this to mortality estimates in the ICU, the Acute Physiology and Chronic Health Evaluation (APACHE) score may serve as an illustrative algorithm [23, 24]. The APACHE scoring system standardizes data collection and combination to assess patients' disease severity and mortality probability. It ensures that clinicians (a) collect the same valid variables, and (b) use a consistent algorithm to combine this data. Thereby the APACHE score enables multiple clinicians to generate the same score for identical patients, minimizing system noise (except for potential inaccuracies in the measurement instruments or inconsistencies in judgment-based variables, such as the Glasgow Coma Scale) and improving judgment accuracy compared to non-standardized assessments.
This expectation is supported by data from Cox et al. [13]. In their field study of 827 patients, Cox et al. found that physicians' predictions of in-hospital mortality had an Area Under the Curve (AUC) of 0.68 (95% CI 0.63–0.73). In comparison, our analysis of the same dataset yielded an AUC of 0.83 (95% CI 0.79–0.88) for the APACHE IV score, a significant improvement. While acknowledging an imbalance in available information between physicians and the APACHE IV, the key takeaway is that algorithmic approaches eliminate human judgment noise, underscoring the advantage of such methods.
Averaging Independent Judgments. A second method for reducing noise is averaging independent or inter-individual judgments, commonly referred to as the “wisdom of the crowd” principle [25]. This approach works because, statistically, averaging independent judgments reduces noise by the square root of the number of judgments averaged. The underlying reason for this is that independent judgments are subject to random errors, but these errors tend to cancel each other out when averaged, leading to a more accurate overall estimate. However, it is important to note that while averaging reduces noise, it does not address bias [1]. Notably, combining intra-individual judgments—such as averaging multiple judgments made by the same clinician over time—can also enhance accuracy, but only if these judgments are non-redundant, meaning they are based on different perspectives or pieces of information [26].
Discussion
This perspective focused on human judgment error, discussing two sources of error: bias and noise. We focused on noise and explained that system noise (i.e., unwanted variability in judgments for identical patients) stems from level, stable pattern, and occasion noise. Additionally, we discussed two ways—using algorithms and averaging independent judgments—to reduce noise.
While we are optimistic about the potential of algorithms, several issues need to be addressed to ensure that their implementation improves patient outcomes. First, the performance of algorithms must be critically examined, as their effectiveness cannot be assumed based on training data performance alone. Providing clinicians with outputs from incorrect models can decrease their judgment accuracy [27]. Moreover, training models based on noisy, human-labeled data—such as diagnostic labels derived from inconsistent expert judgments—can introduce errors into the models [28]. Beyond human error, inherent algorithmic challenges, such as hallucinations, inconsistencies, faithlessness, poor handling of data scarcity, and difficulty managing complexity, can undermine their performance. Secondly, algorithms can have unintended consequences, such as reinforcing social inequalities if they are discriminatory [4, 9,10,11]. These risks highlight the importance of responsible algorithm use to ensure they achieve their intended clinical benefits.
Clinicians will not mindlessly apply an algorithm’s output without additional interpretation. Clinicians synthesize an algorithm’s output as one cue within their broader judgment process, a process known as holistic synthesis [29, 30]. Holistic synthesis is often seen as advantageous, as clinicians may notice factors not incorporated into the model [17, 31]. However, whether it is indeed advantageous requires investigation, as an important and counterintuitive finding from research in other fields is that human judges often identify too many exceptions to algorithms, introducing noise and ultimately reducing accuracy [17, 30, 32,33,34,35]. Future research could explore whether presenting an algorithm’s output as one of several cues for holistic synthesis enhances or reduces judgmental accuracy compared to its unaltered application. If holistic synthesis proves beneficial, identifying which aspects of clinicians’ judgments add value could inform model refinement, ensuring these elements are reliably incorporated.
Beyond synthesizing an algorithm’s output with clinical judgment, sometimes an algorithm's output is ignored [36]. A contributing factor to this underutilization is the lack of transparency in underlying models or the definitions of predictors [36,37,38], which causes distrust in the validity of the algorithm. This is understandable; however, it is important to remember that human decision making is also a ‘black box.’ Research has shown that it is often difficult for individuals, including clinicians, to articulate their decision-making processes [39]. This makes improving mental judgments challenging. Despite the opacity of some algorithms, they can still be rigorously examined and their behavior experimentally tested—something far more difficult to achieve with human decision-makers [40].
We also recognize that making completely independent judgments is not always feasible in real-world ICU settings. However, it is crucial to recognize that independence plays a fundamental role in reducing noise, as it helps to minimize the influence of individual biases and idiosyncratic tendencies. When independence is compromised, for instance, through social influence or group dynamics, the effectiveness of noise reduction through averaging is substantially undermined [1]. To address this, systems could be designed to enable clinicians to make independent assessments and systematically aggregate them.
For critical care clinicians, recognizing judgment errors as a combination of bias and noise is crucial. This understanding underscores the importance of adopting evidence-based interventions to minimize errors. By integrating these strategies into ICU protocols, clinicians can improve the accuracy of high-stakes judgments and decisions, ultimately enhancing patient outcomes and advancing the quality of care in ICU settings.
Availability of data and materials
The pilot study data will be made available upon reasonable request by contacting the corresponding author.
Notes
The methods of this pilot study can be found on Open Science Framework.
Stable Pattern noise is not shown in Table 1, as its assessment involves calculating interaction effects; see (1) for a detailed illustration.
References
Kahneman D, Sibony O, Sunstein CR. Noise: A flaw in human judgment: Hachette UK; 2021.
Beldhuis IE, Marapin RS, Jiang YY, de Souza NFS, Georgiou A, Kaufmann T, et al. Cognitive biases, environmental, patient and personal factors associated with critical care decision making: a scoping review. J Crit Care. 2021;64:144–53. https://doi.org/10.1016/j.jcrc.2021.04.012.
Daniel K. Thinking, fast and slow: Farrar, Straus and Giroux; 2011.
Sunstein CR. Governing by algorithm? No noise and (potentially) less bias. Duke LJ. 2021;71:1175.
Fierro JL, Prasad PA, Localio AR, Grundmeier RW, Wasserman RC, Zaoutis TE, Gerber JS. Variability in the diagnosis and treatment of group a streptococcal pharyngitis by primary care pediatricians. Infect Control Hosp Epidemiol. 2014;35(S3):S79–85. https://doi.org/10.1017/S0899823X00194036.
Palsson R, Colona MR, Hoenig MP, Lundquist AL, Novak JE, Perazella MA, Waikar SS. Assessment of interobserver reliability of nephrologist examination of urine sediment. JAMA Netw Open. 2020;3(8):e2013959. https://doi.org/10.1001/jamanetworkopen.2020.13959.
Robinson P. Radiology’s Achilles’ heel: error and variation in the interpretation of the Röntgen image. Br J Radiol. 1997;70(839):1085–98. https://doi.org/10.1259/bjr.70.839.9536897.
Whelehan DF, Conlon KC, Ridgway PF. Medicine and heuristics: cognitive biases and medical decision-making. Irish J Med Sci. 1971;2020(189):1477–84. https://doi.org/10.1007/s11845-020-02235-1.
Koçak B, Ponsiglione A, Stanzione A, Bluethgen C, Santinha J, Ugga L, et al. Bias in artificial intelligence for medical imaging: fundamentals, detection, avoidance, mitigation, challenges, ethics, and prospects. Diagnostic and Interventional Radiology. 2024: Epub ahead of print. https://doi.org/10.5167/uzh-264698.
Matos J, Gallifant J, Chowdhury A, Economou-Zavlanos N, Charpignon M-L, Gichoya J, et al. A Clinician’s guide to understanding bias in critical clinical prediction models. Crit Care Clin. 2024;40(4):827–57. https://doi.org/10.1016/j.ccc.2024.05.011.
Mittermaier M, Raza MM, Kvedar JC. Bias in AI-based models for medical applications: challenges and mitigation strategies. NPJ Digital Med. 2023;6(1):113. https://doi.org/10.1038/s41746-023-00858-z.
Kahneman D, Klein G. Conditions for intuitive expertise: a failure to disagree. Am Psychol. 2009;64(6):515. https://doi.org/10.1037/a0016755.
Cox EG, Onrust M, Vos ME, Paans W, Dieperink W, Koeze J, et al. The simple observational critical care studies: estimations by students, nurses, and physicians of in-hospital and 6-month mortality. Crit Care. 2021;25:1–8. https://doi.org/10.1186/s13054-021-03809-w.
Ægisdóttir S, White MJ, Spengler PM, Maugherman AS, Anderson LA, Cook RS, et al. The meta-analysis of clinical judgment project: Fifty-six years of accumulated research on clinical versus statistical prediction. Couns Psychol. 2006;34(3):341–82. https://doi.org/10.1177/0011000005285875.
Grove WM, Zald DH, Lebow BS, Snitz BE, Nelson C. Clinical versus mechanical prediction: a meta-analysis. Psychol Assess. 2000;12(1):19. https://doi.org/10.1037/1040-3590.12.1.19.
Kuncel NR, Klieger DM, Connelly BS, Ones DS. Mechanical versus clinical data combination in selection and admissions decisions: a meta-analysis. J Appl Psychol. 2013;98(6):1060. https://doi.org/10.1037/a0034156.
Meehl PE. Clinical versus statistical prediction: A theoretical analysis and a review of the evidence. 1954. https://doi.org/10.1037/11281-000
Dana J, Dawes R, Peterson N. Belief in the unstructured interview: the persistence of an illusion. Judgm Decis Mak. 2013;8(5):512–20. https://doi.org/10.1017/S1930297500003612.
Kemmelmeier M. Separating the wheat from the chaff: Does discriminating between diagnostic and nondiagnostic information eliminate the dilution effect? J Behav Decis Mak. 2004;17(3):231–43. https://doi.org/10.1002/bdm.473.
Karelaia N, Hogarth RM. Determinants of linear judgment: a meta-analysis of lens model studies. Psychol Bull. 2008;134(3):404. https://doi.org/10.1037/0033-2909.134.3.404.
Dawes RM. The robust beauty of improper linear models in decision making. Ration Soc Respons. 2008. https://doi.org/10.1037/0003-066x.34.7.571.
Yu MC, Kuncel NR. Pushing the limits for judgmental consistency: Comparing random weighting schemes with expert judgments. Person Assess Decis. 2020;6(2):2.
Knaus WA, Draper EA, Wagner DP, Zimmerman JE. APACHE II: a severity of disease classification system. Crit Care Med. 1985;13(10):818–29.
Zimmerman JE, Kramer AA, McNair DS, Malila FM. Acute physiology and chronic health evaluation (APACHE) IV: hospital mortality assessment for today’s critically ill patients. Crit Care Med. 2006;34(5):1297–310. https://doi.org/10.1097/01.CCM.0000215112.84523.F0.
Surowiecki J. The wisdom of crowds: Anchor; 2005.
Herzog SM, Hertwig R. Think twice and then: combining or choosing in dialectical bootstrapping? J Experiment Psychol: Learn Mem Cognit. 2014;40(1):218. https://doi.org/10.1037/a0034054.
Jabbour S, Fouhey D, Shepard S, Valley TS, Kazerooni EA, Banovic N, et al. Measuring the impact of AI in the diagnosis of hospitalized patients: a randomized clinical vignette survey study. JAMA. 2023;330(23):2275–84. https://doi.org/10.1001/jama.2023.22295.
Sylolypavan A, Sleeman D, Wu H, Sim M. The impact of inconsistent human annotations on AI driven clinical decision making. NPJ Digital Med. 2023;6(1):26. https://doi.org/10.1038/s41746-023-00773-3.
Meijer RR, Neumann M, Hemker BT, Niessen ASM. A tutorial on mechanical decision-making for personnel and educational selection. Front Psychol. 2020. https://doi.org/10.3389/fpsyg.2019.03002.
Sawyer J. Measurement and prediction, clinical and statistical. Psychol Bull. 1966;66(3):178. https://doi.org/10.1037/h0023624.
Sniderman AD, D’Agostino RB Sr, Pencina MJ. The role of physicians in the era of predictive analytics. JAMA. 2015;314(1):25–6. https://doi.org/10.1001/jama.2015.6177.
Camerer CF, Johnson EJ. The process-performance paradox in expert judgment. Toward a general theory of expertise. 1991. 195–217
Hoffman M, Kahn LB, Li D. Discretion in hiring. Q J Econ. 2018;133(2):765–800. https://doi.org/10.1093/qje/qjx042.
Neumann M, Hengeveld M, Niessen ASM, Tendeiro JN, Meijer RR. Education increases decision-rule use: an investigation of education and incentives to improve decision making. J Exp Psychol Appl. 2022;28(1):166. https://doi.org/10.1037/xap0000372.
Neumann M, Niessen ASM, Linde M, Tendeiro JN, Meijer RR. “Adding an egg” in algorithmic decision making: improving stakeholder and user perceptions, and predictive validity by enhancing autonomy. Eur J Work Organ Psy. 2024;33(3):245–62. https://doi.org/10.1080/1359432X.2023.2260540.
Hill A, Morrissey D, Marsh W. What characteristics of clinical decision support system implementations lead to adoption for regular use? A scoping review. BMJ Health Care Inform. 2024. https://doi.org/10.1136/bmjhci-2024-101046.
Cox EG, Meijs DA, Wynants L, Sels JWE, Koeze J, Keus F, et al. The definition of predictor and outcome variables in mortality prediction models: a scoping review and quality of reporting study. J Clin Epidemiol. 2024. https://doi.org/10.1016/j.jclinepi.2024.111605.
Pinsky MR, Bedoya A, Bihorac A, Celi L, Churpek M, Economou-Zavlanos NJ, et al. Use of artificial intelligence in critical care: opportunities and obstacles. Crit Care. 2024;28(1):113. https://doi.org/10.1186/s13054-024-04860-z.
Nisbett RE, Wilson TD. Telling more than we can know: Verbal reports on mental processes. Psychol Rev. 1977;84(3):231. https://doi.org/10.1037/0033-295X.84.3.231.
Kleinberg J, Ludwig J, Mullainathan S, Sunstein CR. Discrimination in the age of algorithms. J Legal Anal. 2018;10:113–74. https://doi.org/10.1093/jla/laz001.
Acknowledgements
Not applicable.
Funding
No specific funding was acquired.
Author information
Authors and Affiliations
Contributions
I. P. Peringa: Conceptualization, Data Curation (Current data), Formal Analysis (Current data), Investigation, Methodology, Project Administration, Visualization (incl. faces), Writing—Original Draft, Writing—Review and Editing; E. G. M. Cox: Conceptualization, Data Curation (SICS data), Formal Analysis (SICS data), Methodology, Visualization (incl. faces), Writing—Review and Editing; R. Wiersema: Conceptualization, Curation (SICS data), Formal Analysis (SICS data), Visualization (incl. faces), Writing—Review and Editing; I.C.C. van der Horst: Conceptualization, Curation (SICS data), Formal Analysis (SICS data), Supervision, Visualization (incl. faces), Writing—Review and Editing; R.R. Meijer: Conceptualization, Data Curation (Current data), Formal Analysis (Current data), Investigation, Methodology, Project Administration, Supervision, Visualization (incl. faces), Writing—Original Draft, Writing—Review and Editing; J. Koeze: Conceptualization, Curation (SICS data), Formal Analysis (SICS data), Investigation, Methodology, Project Administration, Supervision, Visualization (incl. faces), Writing—Review and Editing
Corresponding author
Ethics declarations
Ethics approval and consent to participate
The pilot study adhered to the ethical guidelines of the University of Groningen and was exempt from an ethics review following a low-risk self-assessment, based on criteria established by the Ethics Committee of the Faculty of Behavioral and Social Sciences (PSY-2324-S-0341). Additionally, the study utilized data from the SICS II study, which received approval from the local institutional review board (Medical Ethics Review Committee, University Medical Center Groningen; M18.228393, 2018/203).
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Peringa, I.P., Cox, E.G.M., Wiersema, R. et al. Human judgment error in the intensive care unit: a perspective on bias and noise. Crit Care 29, 86 (2025). https://doi.org/10.1186/s13054-025-05315-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13054-025-05315-9