Saturday, June 14, 2008

Applying Dan's assessment system, Part II - scoring

Note: A discussion of more general lessons learned while applying this assessment system is posted here. This entry is a dry, technical discussion of scoring and grade calculations, of interest only to teachers thinking of applying this system themselves.

Dan Meyer assesses students at least twice on every test item, scoring out of 4 each time. At the second round of assessment, he alters the possible points from 4 to 5. If a student scores a 4 on both rounds of assessment, she nets a 5. Otherwise, her highest score applies. Dan makes the second round of assessments a little harder than the first, so that a second 4 indicates greater skill than the first 4.

Altering the possible points from 4 to 5 entails that students who do not actually improve their performance from one assessment to the next automatically see their grade drop. For example, if a student scores a 3 on the first assessment, and then another 3 on the second, more demanding assessment of the same skill, the grade on that particular concept drops from a 3/4=75% to a 3/5=60% - from a solid C to a D-. This caused problems that almost made me abandon the system early last fall. For one, students got upset when their grade dropped without their knowledge having changed. Secondly, having grades drop after a progress report has been issued is not actually legal - or so I was told by my Department Head. An evening shortly after the first progress reports had been printed found us manually going through all the scores in the gradebook, altering the scores so that the grades would be the same as before, for example by changing 3/5 to 3.75/5, since that is equal to 3/4=75%.

Another problem with this system was that a scale from 0 to 4 seemed fairly coarse grained. Students who made a mistake significant enough not to merit a top score on the first assessment would be marked down by 25 percentage points, and if they did not improve markedly by the second assessment they would net a D-. Improvement from this D- would be possible only if they subsequently scored a perfect score. I first thought that the large number of skills and the repetition of assessments would lead to an adequate continuity of the total grading scale, that students might average a C by scoring perfectly on some skills and poorly on others. However, some students seemed, even when working hard, to be unable to ever score a 4. They'd always make some or other significant mistake, but not enough to make a D- seem appropriate. Now I am sure that in the mutual adjustment of quiz difficulty and scoring practice there is some wiggle room for making this work in a fair way, and I assume Dan Meyer has figured out a balance here. However, I ended up changing my grading scale.

Solving these problems proved pretty difficult without losing important features of the original system, however, and I found no perfect solution. I wanted my score assignment to do what Dan's did, in particular, to make it necessary for students to take every assessment twice, in order to ensure stability and retention of knowledge. Dan's practice of increasing the possible points does just that - students can not just be satisfied with their 3/4=75% and decide not to attempt the second assessment of the same skill. In the end I decided not to report students' scores online until they had had both assessments. I made the two assessments of equivalent difficulty (which simplified things for me) and then grades were assigned based on students' best two scores according to the following table:

In summary: For scores of 3 or lower, the higher score applies. If both scores are above 3, the grade is the average of the two. If one score is above a 3 and the other below, the grade is the average of 3 and the higher grade. With this score assignment, students still had an incentive to demonstrate perfect mastery twice, in order to net a grade of a 100.

A disadvantage of this system is it's clunkiness compared to Dan's simpler system. Much of the appeal of this whole approach to grading was its transparency to students, the clarity it could afford them about what to focus on. Some of this is lost with this conversion table. Also, since the best two scores count, the system appears to have somewhat more inertia; poor scores don't go away as fast as they seem to in the original system, where the better score always counts. This slower improvement is more appearance than reality, since two 4's are necessary to achieve a 100 in Dan's system too, but appearance matters in this context. The main disadvantage, however, was switching to this different scale after the first progress report, which caused some confusion and, I think, some loss of buy-in from students. They seemed a little less enthusiastic about completing their tracking sheets after that.

As an alternative, I experimented a little with just entering both of the best two scores into PowerGrade this spring, labeling the entries "Skill 14A" and "Skill 14B," for example, and assigning half weight to each. I am undecided about whether I will do this in the fall or just enter the composite grade. It is of paramount importance that the students understand the relation between the scores on the papers they get back and the scores on their grade printout, and this system would help in that regard, but it would make for a large number of gradebook entries, which means more messiness.

Finally, a note on the scoring of any quiz item: In some cases it made sense to assign a point value to different components of the test item, and sometimes I wrote the test items to make this possible. Other times, I evaluated the complete response to the test item as a whole, and assigned scores as follows:


Frankly, for some skills that did not lend themselves well to decomposition into parts with point values for each, I'd score based on my mental image of what a D-, a B- and an A would look like. If grades are supposed to be derived from scores rather than the other way around, that introduces some circularity that one might argue about, but I don't care. I think grades as descriptors of performance levels rather than as translations of some numerical score make more sense anyway. But that is another story that would make for a separate discussion.

And since this scoring business turned out so much trickier than I'd anticipated, well-thought out suggestions for making it clearer and fairer would be appreciated.

Friday, June 13, 2008

Applying Dan's assessment system, Part I

Dan Meyer breaks his courses into some 35 discrete skills and concepts, keeps separate records on students' performance on each skill, and keeps retesting students and counting their highest scores. The following two entries are some notes on things I learned while applying an adapted version of his system to my Algebra and Intermediate Algebra this year. The second entry is a dryly technical discussion of scoring.

In accordance with my Department Head's recommendation, I did not entirely replace traditional comprehensive tests with this more piecemeal system. For Algebra 1, these concept quizzes were weighted at 40% of students' grades while comprehensive tests made up the remaining 30% of the assessment grade. For Intermediate Algebra I weighted the two types of assessments at 35% each. My experiences were that...

...this system worked significantly better for Algebra 1 than for Intermediate Algebra.

In Algebra 1, I felt that pretty much everything the students really needed to know was covered by the concept quizzes – I might as well not have done chapter tests at all. For Intermediate Algebra, however the skills tended to get cumbersomely complex or impossibly many, and the supplemental chapter tests were necessary and useful.

One reason is that Intermediate Algebra, which is essentially the first 70-80% of Algebra 2, covers much more1. Another reason is that synthesis and solution of multi-step problems are inherent, irreducible goals of Algebra 2, and these skills need to be assessed, too.2

... for diagnosing and remedying deficiencies in basic skills, this system was beautiful.

At some point early in the semester I realized that a number of incoming Algebra 1 students did not know the concept of place value and could place neither decimal numbers nor fractions on a number line. Writing an assessment on placing decimals on the number line made it possible to separate out who was having trouble with this, and to know when a critical mass of students had caught up in this area. As a tool for probing missing background skills and for placing these skills clearly and definitely on the agenda this was powerful.

... writing effective assessment items was harder than I thought.

When an assessment may potentially be repeated two, three, even five or six times, what it measures had better be really important, and the assessment had better actually capture the intended skill. It is not as easy as it may sound to decide which elements of the course really are that important; which are the parts on which other understanding hinges. My list of concepts to be assessed always tended to get too long, and trimming down to the real essentials was a constant challenge. As for designing valid measurements of students' skills, I guess only experience makes it possible to figure out what kinds of problems will really show that they know what they need to know, what kinds of problems plough just deep enough without getting too involved, what kinds of misunderstandings are typical and must be caught in order to make necessary remediation possible.3

... assessments are not enough. Improvement is not automatic.

That's obvious, of course. How silly to think otherwise. Frankly, part of what I found attractive about this assessment system was the idea that with goals broken down into such small, discrete pieces, students would become empowered and motivated and take the initiative to learn what they needed to make the grade. That was actually to a significant extent the case. Tutoring hours were far more efficient due to the existence of these data, and students knew what to do to "raise their grade." However, a lot of students continued to score poorly, repeating the same mistakes, after three, four, five rounds of assessment on the same topic. Some would come during tutoring hours to retake a quiz and still make exactly the same mistakes... For weaker students especially, then, it is important to remember that the assessment data are tools for me to actually use. There is no automaticity in the translation of this very specific feedback into actual understanding.

... the transparency of the system means bad things are out there for everyone to see.

That's what we want, don't we? The direct and honest reporting involved was a major appeal of this system. However, it takes some foresight for this not to lead to discouragement. While it is pretty common practice among math teachers, any teachers, to rescale test scores so that the class average turns out okay, this could not be done in any simple way with these conceptwise assessments. The only way to improve class grades was by reteaching the material and testing again. This involved a time delay during which the grades, which were published in an online gradebook, could be quite low. This was especially true during the first month or two of school, when the grades were constituted by relatively few entries, and - well - the first months of school may not be the time you want parents to worry about what you're doing when you're a new employee. In the early stages I ended up scaling chapter tests a good deal in order to compensate for some low concept quiz scores and make the overall grades acceptable. With time, a combination of rewriting certain concept quizzes that were needlessly tricky and teaching some topics better made this less necessary. 4

In conclusion, I am definitely keeping this system for Algebra 1, probably increasing the weighting of these assessments and reducing the number and importance of comprehensive tests. For Intermediate Algebra I am keeping chapter tests, and writing a new set of piecemeal assessments to cover just the basics, so that I can have the hard data on who is really lost, but without even trying to force these assessments to cover the entire curriculum. I'll need to make sure that the first skills are very well taught and mastered before the first round of assessments: thinking a little strategically to make sure the early results are good increases buy-in, and student ownership is after all much of the point here.


Notes

1 By way of example, a comparison of the content of the chapters on exponents in the two courses: To assess mastery of this chapter for Algebra 1, I needed to check that students knew the definition of a natural power as repeated multiplication, that they could apply the power rules to simplify expressions, that they could deal with negative and zero powers, that they could complete a table of values of a simple exponential function such as 2x and plot the points to sketch a simple exponential graph. For the chapter on exponential and logarithmic functions for Intermediate Algebra, however, I needed to check whether students could do all of the above, plus convert between exponential and logarithmic form, apply the properties of logarithms, solve exponential and logarithmic equations by applying one-to-one properties, solve such equations by applying inverse properties, apply the change-of-base formula, apply the compound interest formula, identify transformations of the exponential function, understand that exponential and logarithmic functions are inverses of each other, plus a few other things that I just skipped. The number of chapters to be covered is pretty much the same for both courses, but the number of concepts and skills? Different stories. Writing broader concept tests for more advanced courses is a possibility, but the advantages of this piecewise assessment system over the usual comprehensive test system is quickly lost this way.

2 For an example of how some core skills of Intermediate Algebra are by nature multi-step and integrative, consider the case of solving a third degree polynomial equation by first finding a root by graphing, then dividing by the corresponding linear factor, then applying the quadratic formula to find the remaining roots. This task is too complex for a concept wise assessment to be very useful. I had separate assessments on 1) identifying factors given the graph of a polynomial, on 2) polynomial division and rewriting a polynomial using the results of the division process, on 3) stating and applying the factor theorem, and 4) applying the quadratic formula. I still wanted to check whether the students could put it all together.

3 As for the assessment being valid, actually capturing the important skill, here's an example of a failed attempt: I wrote one concept quiz about identifying the difference between an equation and an expression, about distinguishing the cases where you solve for a variable from the case where you can only simplify – but success on this assessment did not mean an end to confusing these two cases. Does that mean that the assessment was poorly written, or rather that this distinction just doesn't lend itself to being assessed once and for all in a little concept quiz? Is understanding equivalence, and distinguishing equations as statements that are true or false from expressions that just are, too abstract ideas to be covered this way? I don't know, but my impression is that the quiz did little to eradicate the common mistake of treating expressions as if they were equations, for example by adding or subtracting new terms in order to simplify.

4 This is at a private school, where determining the required level of mastery of each standard is to a larger extent up to the teacher, since no state testing is involved in defining the bar.