7% was error variance ( σerror2), which includes the variance due to rater unreliability. The calculation of the intraclass correlation coefficient for absolute agreement between raters yielded an ICCa,1 of 0.943, which indicates excellent interrater reliability. 48.1% of the variance can be attributed to score differences between residents, while 45.5% is attributable to score differences between consultations. This variance component represents genuine residents-by-consultation-interaction variance. The inconsistency coefficient for all consultation combinations was

0.482. The correlation between the average score of the first and second consultations and the inconsistency score (R0,inconsist) was almost zero (−0.044) for all consultation combinations. The mean of score differences between the first and

second consultations, indicated by μ dif, did not differ between the similar and dissimilar consultation combinations (0.030 and −0.533, t = 1.31, df = 48, p ≥ .05). However, the distributions of inconsistency scores differed significantly between the similar and dissimilar consultations (Mann–Whitney U test, p < .05). The variance components also differed significantly between the similar and dissimilar consultation combinations. In the similar consultation combinations, the major proportion of variance (65.1%) was linked to differences between residents ( σresidents2), while in the dissimilar consultation combinations, the major proportion of the variance (67.5%) was linked to differences in residents’ performance between consultations during ( σresid×consult2). Thus, the inconsistency coefficients ( Rinconsist2) of the similar and dissimilar consultation combinations were also different (F = 16.41, p < .01). The Spearman correlation coefficient between the average score of the first and second consultations and the inconsistency scores (R0,inconsist) was significant for the dissimilar consultation combinations (−0.538), but not for the similar consultation combinations (0.111). CST background had a significant effect on the average scores of all consultation combinations (Table 3, η2 = 0.243, F = 7.53,

p < .01). However, the CST background effect was only present in the BBN consultations (η2 = 0.433, F = 9.93, p < .01) and in the PMD consultations (η2 = 0.209, F = 3.83, p < .05). CST background had no effect on the performance in the other consultations and had no effect on the inconsistency scores in any of the consultation combinations (Mann–Whitney U tests). Reliability and generalizability studies consider performance inconsistency between consultations as a measurement error. However, physicians are expected to communicate equally well in all consultations. Adequate communication in some consultations but mediocre or inadequate communication in others is unacceptable. In this study, we thus explored the inconsistency of residents’ communication performance in challenging consultations.