Comparative evaluation and performance of large language models on expert level critical care questions: a benchmark study

Table 2 Consistency scores for all models. Consistency reflects the percentage (%) of questions for which a model provided the same answer in at least 8 out of 10 repetitions, from a random subset of 100 questions. Consistently correct and consistently incorrect scores indicate the proportion of these responses that were accurate or erroneous, respectively

Model	Consistency (%)	Consistenly correct (%)	Consistenly incorrect (%)
GPT-4o	96.0	88.5	11.5
GPT-4o-mini	93.0	76.3	23.7
GPT-3.5-turbo	74.0	67.6	32.4
Llama 3.1 70B	92.0	81.5	18.5
Mistral Large 2407	100.0	81.0	19.0

ISSN: 1364-8535