Your privacy, your choice

We use essential cookies to make sure the site can function. We also use optional cookies for advertising, personalisation of content, usage analysis, and social media.

By accepting optional cookies, you consent to the processing of your personal data - including transfers to third parties. Some third parties are outside of the European Economic Area, with varying standards of data protection.

See our privacy policy for more information on the use of your personal data.

for further information and to change your choices.

Skip to main content

Table 2 Consistency scores for all models. Consistency reflects the percentage (%) of questions for which a model provided the same answer in at least 8 out of 10 repetitions, from a random subset of 100 questions. Consistently correct and consistently incorrect scores indicate the proportion of these responses that were accurate or erroneous, respectively

From: Comparative evaluation and performance of large language models on expert level critical care questions: a benchmark study

Model

Consistency (%)

Consistenly correct (%)

Consistenly incorrect (%)

GPT-4o

96.0

88.5

11.5

GPT-4o-mini

93.0

76.3

23.7

GPT-3.5-turbo

74.0

67.6

32.4

Llama 3.1 70B

92.0

81.5

18.5

Mistral Large 2407

100.0

81.0

19.0