In language testing, rater studies tend to examine rating performance at a single time point. Research on the effect of rater training either adopts a pre- and post-training design or compares rating performance of novice and experienced raters. While those studies provide insights into the end results of rater training, little has been known about how rater performance changes during the process of rater training. This study examined how rater performance develops during a semester-long rater training and certification program for a post-admission English as a second language (ESL) writing placement test at a large US university.
The certification program aims to align raters to a newly developed rating scale that provides both placement recommendations and diagnostic information regarding students’writing proficiency level and skill profile. The training process employed an iterative, three-stage approach, consisting of face-to-face group meetings, individual rating exercises, and scale re- calibration based on rater performance and feedback. Using many-facet Rasch modeling (Linacre, 1989, 2006), we analyzed rating quality of 17 novice raters across four rounds of rating exercises. Rating quality was operationalized in terms of rater severity and consistency, raterconsensus, and raters’ use of rating scale. These measurement estimates of rater reliability werecompared across time and between certified and uncertified raters.
At the start of the training program, all raters were inconsistent, varied widely in severity, and achieved low exact score agreement. Over time, certified raters improved on multiple indices of rating quality and became more indistinguishable from one another in the application of the rating scale. However, rater performance did not improve in a linear fashion but instead followed a U-shaped developmental pattern. In contrast, uncertified raters’ performance remainedinconsistent across rounds. Findings of this study have implications for the effectiveness of rater training and developmental patterns of rating behavior over time.