Note: A discussion of more general lessons learned while applying this assessment system is posted here. This entry is a dry, technical discussion of scoring and grade calculations, of interest only to teachers thinking of applying this system themselves.
Dan Meyer assesses students at least twice on every test item, scoring out of 4 each time. At the second round of assessment, he alters the possible points from 4 to 5. If a student scores a 4 on both rounds of assessment, she nets a 5. Otherwise, her highest score applies. Dan makes the second round of assessments a little harder than the first, so that a second 4 indicates greater skill than the first 4.
Altering the possible points from 4 to 5 entails that students who do not actually improve their performance from one assessment to the next automatically see their grade drop. For example, if a student scores a 3 on the first assessment, and then another 3 on the second, more demanding assessment of the same skill, the grade on that particular concept drops from a 3/4=75% to a 3/5=60% - from a solid C to a D-. This caused problems that almost made me abandon the system early last fall. For one, students got upset when their grade dropped without their knowledge having changed. Secondly, having grades drop after a progress report has been issued is not actually legal - or so I was told by my Department Head. An evening shortly after the first progress reports had been printed found us manually going through all the scores in the gradebook, altering the scores so that the grades would be the same as before, for example by changing 3/5 to 3.75/5, since that is equal to 3/4=75%.
Another problem with this system was that a scale from 0 to 4 seemed fairly coarse grained. Students who made a mistake significant enough not to merit a top score on the first assessment would be marked down by 25 percentage points, and if they did not improve markedly by the second assessment they would net a D-. Improvement from this D- would be possible only if they subsequently scored a perfect score. I first thought that the large number of skills and the repetition of assessments would lead to an adequate continuity of the total grading scale, that students might average a C by scoring perfectly on some skills and poorly on others. However, some students seemed, even when working hard, to be unable to ever score a 4. They'd always make some or other significant mistake, but not enough to make a D- seem appropriate. Now I am sure that in the mutual adjustment of quiz difficulty and scoring practice there is some wiggle room for making this work in a fair way, and I assume Dan Meyer has figured out a balance here. However, I ended up changing my grading scale.
Solving these problems proved pretty difficult without losing important features of the original system, however, and I found no perfect solution. I wanted my score assignment to do what Dan's did, in particular, to make it necessary for students to take every assessment twice, in order to ensure stability and retention of knowledge. Dan's practice of increasing the possible points does just that - students can not just be satisfied with their 3/4=75% and decide not to attempt the second assessment of the same skill. In the end I decided not to report students' scores online until they had had both assessments. I made the two assessments of equivalent difficulty (which simplified things for me) and then grades were assigned based on students' best two scores according to the following table:
In summary: For scores of 3 or lower, the higher score applies. If both scores are above 3, the grade is the average of the two. If one score is above a 3 and the other below, the grade is the average of 3 and the higher grade. With this score assignment, students still had an incentive to demonstrate perfect mastery twice, in order to net a grade of a 100.
A disadvantage of this system is it's clunkiness compared to Dan's simpler system. Much of the appeal of this whole approach to grading was its transparency to students, the clarity it could afford them about what to focus on. Some of this is lost with this conversion table. Also, since the best two scores count, the system appears to have somewhat more inertia; poor scores don't go away as fast as they seem to in the original system, where the better score always counts. This slower improvement is more appearance than reality, since two 4's are necessary to achieve a 100 in Dan's system too, but appearance matters in this context. The main disadvantage, however, was switching to this different scale after the first progress report, which caused some confusion and, I think, some loss of buy-in from students. They seemed a little less enthusiastic about completing their tracking sheets after that.
As an alternative, I experimented a little with just entering both of the best two scores into PowerGrade this spring, labeling the entries "Skill 14A" and "Skill 14B," for example, and assigning half weight to each. I am undecided about whether I will do this in the fall or just enter the composite grade. It is of paramount importance that the students understand the relation between the scores on the papers they get back and the scores on their grade printout, and this system would help in that regard, but it would make for a large number of gradebook entries, which means more messiness.
Finally, a note on the scoring of any quiz item: In some cases it made sense to assign a point value to different components of the test item, and sometimes I wrote the test items to make this possible. Other times, I evaluated the complete response to the test item as a whole, and assigned scores as follows:
Frankly, for some skills that did not lend themselves well to decomposition into parts with point values for each, I'd score based on my mental image of what a D-, a B- and an A would look like. If grades are supposed to be derived from scores rather than the other way around, that introduces some circularity that one might argue about, but I don't care. I think grades as descriptors of performance levels rather than as translations of some numerical score make more sense anyway. But that is another story that would make for a separate discussion.
And since this scoring business turned out so much trickier than I'd anticipated, well-thought out suggestions for making it clearer and fairer would be appreciated.