Is interrater correlation really a proper measurement of reliability?


Topic: Measurement, Research Methodology, Statistics
Publication: Human Performance
Article: Exploring the relationship between interrater correlations and validity of peer ratings
Blogger: Rob Stilson

Interrater reliability (still with me?, Ok good) is often used as the main reliability estimate for the correction of validity coefficients when the criterion is job performance. Issues arise with this practice when one considers that the errors present between raters may not be random, but due to bias, while agreement between raters may also stem from bias instead of actual consistency. In this study, the authors’ main goal was to explore the relationship between interrater correlations and validity and also to explore the relationship between the number of raters and validity.

In order to do this, the authors gathered information from 3072 Israeli policemen from 281 work teams who took part in peer rating. The average size of each of these work teams averaged about 12 people and ranged from 5 all the way to 33. The measure used was overall performance (on a 7-point Likert scale). The predictor employed in this study was the ICC (C,k) model, which is equivalent to Cronbach’s alpha. Measurement indices were computed on the team level as rating only took place within work teams.

The predicted variable for the study was the validity coefficient for each work team. This is the part of the study where you could really feel the sweat involved. Here the authors gathered information on
supervisor evaluations, absenteeism data, and discipline data collected over several years (for over 3000 policemen)! The authors then converted this information into z scores with higher scores indicating better performance.

Results showed a weak positive linear relationship between interrater correlations and the various validity indexes. This is not what you want to hear if you are doing peer rated performance evaluations. The authors’ stipulate that the correlation between raters is a conglomeration of factors
having different theoretical relationships with validity (i.e. bias and other idiosyncrasies).

Practical implications from the information gleaned here include the adjustment of validity due to attenuation. If the measurements used in the calculation included non random error estimates, the ensuing calculations will be off. A positive finding for the work world was validity in small units (less than 10 people) was about the same as those for larger units. The authors’ believe this finding may be due to observation opportunity level, which is seemingly greater in smaller work units.

Kasten, R., and Nevo, B. (2008) Exploring the relationship between interrater correlations and validity of peer ratings. Human Performance, 21(2), 180-197.