Criterion-Based Grading: The Alternative to Bad Traditional Grading Practices

Grading

Jun 15

Image depicts different forms of measurement, such as a ruler, a scale and a clock. — Image generated by ChatGPT. I think it can be helpful for teachers to think of assessment and grading in terms of data collection and measurement.

In my previous post, I critiqued the use of traditional grading practices and concluded that the biggest problem is simply that they’re normative grades, meaning that they’re not referenced to any predetermined criteria, but rather are merely comparisons to, and a means of sorting amongst, student peer groups. Based on traditional grading practices, an A grade in a class provides no information about what a student actually knows, understands or is able to do. It merely indicates that the student was better at being a student in that class than most of the other students in that class. This raises an additional problem with traditional grading that I intentionally left out of my previous post. I left it out because the problem is largely irrelevant within the context of normative grading. The problem is something that Tomlinson and Moon (2013), in their book Assessment and Student Success in a Differentiated Classroom, call “grade fog” (p. 132).

Let me unpack this problem of grade fog using the example I stated above of a student with a grade of A in a particular class. I stated that within traditional normative grading, that A indicates that this student was better “at being a student” than most of the other students in her class. That seems like an odd way of putting it. In my previous post I suggested that traditional grades are based on a paradigm that values how much a student remembers of the content that was covered in the class. In other words, the basis for grades is some loose sense of how much a student remembered of the covered content. That’s paradigmatically true, but in practice, teachers often combine all sorts of different things into a single grade. They provide grades for student behavior, work-ethic, participation in class, homework completion, neatness, attendance, promptness, skills in academic tasks, etc., on top of grades for assignments intended to measure knowledge of the course content. All of these grades then get averaged together for a final grade.

This is what Tomlinson and Moon mean by grade fog. The different expectations of what it means to be an A student are so convoluted – not to mention that they differ dramatically from one teacher to the next – that it’s completely unclear what that A actually represents. Does it mean that he worked really hard, was super dutiful, turned everything in on time, and always completed homework, even though he was often quite confused about the actual content of the course? Or does it mean that she has a really good memory for everything that is said in class, even though she’s kind of lazy and doesn’t put forth much effort? These are descriptions of two very different students, but they could potentially end up with the same grade. When grades are normative, this problem of grade fog doesn’t really matter because the point of the grade is not to reference any pre-set criteria anyway; the point of the grade is to compare and sort students. But, if you agree with me that this traditional normative grading is deeply flawed and even harmful to student learning, then as we discuss alternatives, we will have to also address this problem of grade fog. I’ll come back to this issue a little further down in this post.

I’ve now mentioned several times the problem of normative grades not being referenced to any predetermined set of criteria. This brings us to an alternative to normative grading, which is criterion-based grading. In short, criterion-based grading means that the grade reflects the extent to which the student achieved the criteria for that assignment, unit, or course. Other terms that could be substituted for “criteria'' could be, “goals,” or “objectives,” but the point is that the criteria is set from the outset, and is transparently clear to both the teacher and the students. I’m a former track & field coach, so I feel compelled to provide a brief analogy from the track. If I lined eight student-athletes up in starting blocks for a 400m race, fired the gun, watched them run, ranked them at the finish line based on their finishing position, and then assigned them grades based on their finishing position (1st place gets an A, 2nd & 3rd get Bs, 4th - 6th get Cs, etc.), that would be an example of normative grading. However, if instead I lined eight student-athletes up at the start line and declared a goal based on a time to beat – let’s say 1 min, 5 sec – and then fired the gun and watched them run, my grades would look completely different. Instead of standing at the finish line and recording their finishing position, I’d stand at the finish line with my stopwatch, recording each student’s finishing time. In this case, the students aren’t competing against each other; they’re competing against the clock. So what would my grades look like in this case? Well, they’d be some letter, number, symbol or set of words that represent whether or not each student met the goal. For the student that finished with a time less than 1 min, 5 sec, I’d have a grade that communicated: “met the goal.” For students that blitzed around the track, ran sub-55 sec, and completely crushed the goal, I could have a grade that communicated: “exceeded the goal.” For students that didn’t quite make the goal, but were close and probably could make the goal with a little more practice, I’d have a grade that communicated: “approaching the goal.” For students who were way off the goal, I’d have a goal that communicated: “did not meet the goal” or “just starting towards the goal.” Notice that when there is a predetermined criteria against which the students are being measured, there is no consideration at all of a normal distribution curve. If all eight of the students run the lap in under 1 min, 5 sec, they all get the “met the goal” grade; I don’t need to sort the fastest from the slowest and assign grades based on rank.

I can extend this track analogy even further to show the importance of the coach (ie. the teacher) when it comes to the goal. Instead of announcing the goal on the start-line, let’s say the coach announces the goal a month prior to the race. Then, for the next month, the coach guides the student-athletes through regular, carefully designed practices, targeting fitness, endurance, speed work, and technique specific to a 400m race. At the end of the month of these practices, the coach lines the eight student-athletes up on the start line, reminds them of the goal, and then fires the gun. I suspect a good number of those athletes would meet the goal, because their performance, after a month of guided training, is not based on random, naturally occurring, independent variables (see previous post regarding normally distributed data sets). Rather, it’s been highly influenced by the coaching and the practice sessions, just as classroom learning should be impacted by the teacher and the lessons. (Note: as with any analogy, there are points at which it breaks down and doesn't provide a perfect model for what I’m trying to describe; in the case of a 400m race on a track, admittedly, it would not be fair for every student-athlete to have the same goal).

What are we talking about in terms of criteria when it comes to classroom learning? What are the goals or objectives for a class? Classroom learning criteria are often referred to as KUDs (because it’s education and educators love acronyms). KUDs refer to what students are expected to Know, Understand and be able to Do by the end of a segment of learning. Teachers typically refer to these segments of learning as “units.” Units can range widely in length and complexity, depending on the nature of the subject, preferences of teachers, and grade level. At the high school level, I’m rather inclined towards long 8-10 week units that extend over an entire academic quarter, but there’s definitely an argument to be made for breaking up and chunking learning into smaller units of 2-3 weeks each. Regardless of the length of a unit, the point is that, from the very outset of the unit, the teacher must be clear about exactly what the students are expected to know, understand and be able to do by the end of that unit. Once this is clear, assessments, individual lessons, instructional strategies, student practice exercises, etc. are all designed (just like the coach’s practice sessions) with these KUD goals in mind. Furthermore, the KUDs must also be transparent to the students. They must know the goals towards which they are working. To be fair, depending on the grade level of the students, the KUDs communicated to students may need to be intentionally re-written in student-friendly language, but the students need to know the criteria against which they will ultimately be measured.

When the criteria are established in advance and all instruction is geared towards those criteria, then assessments must be designed that measure whether or not the students have met the criteria. Wiggins and McTighe, the developers of the Understanding by Design (UbD) model for unit development, argue that the unit summative assessments should actually be developed at the outset of the unit, prior to any instructional planning, and prior to teaching any lessons (see Understanding by Design Guide to Creating High Quality Units, 2011). This is why their model is referred to as “backward design.” It starts with the learning criteria students will be expected to meet by the end of the unit, then develops the assessments that will measure whether or not students met those criteria, and then develops and implements lesson plans and instructional strategies that will guide students to success on those assessments. They argue that, by doing this, the teacher ensures that the summative measurements of the unit learning are directly aligned to the unit KUDs. Again, this is just like the coach who states the goal, determines the way of measuring achievement of the goal (the 400m race), and then goes about designing and leading practice sessions to guide the students towards success in that race.

An important question at this point will be: “Where do these unit criteria come from?” The answer here is that there are many different sources that will likely influence the criteria of a unit and it will depend on the teaching and learning context. Obviously the particular discipline of the class will be important in determining these criteria. There are unique facts, concepts, principles, procedures and skills to each discipline. There may also be school-level or district-level learning priorities that influence these criteria. These days in education circles, particularly in American education circles, curricular standards will definitely play a primary role in determining the criteria (state standards, the Common Core, etc.). In fact, sometimes criterion-based grading is referred to as “standards-based” grading. This requires a few clarifications. First, many curriculum experts, including Wiggins and McTighe, argue that the criteria of a unit should be articulated as KUDs (Wiggins and McTighe use the terms: Understandings, Knowledge and Skills, and also emphasize an overarching Transfer goal). What are the specific facts and concepts that students will be expected to know? What are the big ideas and principles that the students will be expected to understand? What are the skills that the students will be expected to be able to do? Most sets of standards are not written in this form, which means that they must be unpacked, broken-down, and expressed as KUDs. In other words, the standards themselves are not the unit criteria, though the unit criteria are derived from them. Second, some models of standards-based grading involve teachers providing a separate grade for each individual standard and reporting all of these separate grades to parents. While this is one model, as I will explain below, this is not the only model for criterion-based grading. In fact, it’s not the model that I prefer, largely because it can result in tedious and overwhelming amounts of grade information for students and parents and because curricular standards are not always written in ways that are meaningful to parents.

What exactly do the grades look like when doing criterion-based grading? There are different models, but my personal preference is a simple four-point scale that looks like this:

4 = student has met all or almost all of the criteria and in some cases may have exceeded the criteria.
3 = student has met most of the criteria.
2 = in most instances, the student is approaching the criteria.
1 = the student has not met the criteria.

Of course, using numbers for these grades is not actually necessary. They don’t get summed or averaged in any way, so one could just as easily drop the numbers and use only words:

Instead of 4, call it “Highly Proficient” or “Exceeds”
Instead of 3, call it “Proficient”
Instead of 2, call it “Approaching”
Instead of 1, call it “Does Not Meet” or “Novice”

If a school is particularly tied to letter grades – and there’s a reasonable argument for this given that parents and students may already be quite familiar and comfortable with letter grades – they can still be used. In this case, one can stretch things out to five different levels:

A = student has met all or almost all of the criteria and in some cases may have exceeded the criteria.
B = student has met most of the criteria.
C = in most instances, the student is approaching the criteria.
D = the student is just starting towards the criteria
F = the student has not met the criteria

There is also nothing magical about having four or five different levels, although Thomas Guskey (2015), a key expert on grading and grading reform in schools, in his book On Your Mark, argues that beyond four or five levels of distinction, grading reliability breaks down. In other words, beyond four or five levels of distinction, it becomes increasingly more difficult for teachers to distinguish consistently between the levels. If one had a 10-point scale, for example, what exactly would distinguish between a 7-level peice of student work and a 6-level piece of student work. Any teacher who has ever written a scoring rubric for an assignment knows all too well how difficult it gets to meaningfully describe distinctions between performance levels. I think I’d go crazy if I was expected to write level descriptors for ten different levels of performance. If teachers can’t describe these levels of distinction, how can they be expected to identify them when they are scoring student work? I think that teachers who argue that they can, are just deceiving themselves. I’ve participated in enough score-calibration sessions with other teachers to know how widely divergent a group of teachers can be when scoring the same piece of student work, even when there are only four or five levels of distinction, let alone more than that.

As an example, my previous school made applaudable steps towards criterion-based grading during the time I was teaching there. Because the high school had the IB Diploma Program in grades 11 and 12, and the IBDP uses a 1-7 grading scale, the high school division of my school decided to use a 1-7 scale across grades 9-12. This made a lot of sense from the perspective of consistency throughout the high school, but even seven levels of distinction on a criterion-based grading scale were problematic. Grades of 4 and 5, for example, were both considered “Proficient” grades, but the distinction between a 4 and 5 was pretty blurry. The same was true of grades of 6 and 7, which were both considered “Exemplary” grades. Even stranger, grades of 1 and 2 were both considered “Does Not Meet.” How exactly does one distinguish between two different levels of not meeting the criteria? And if you could, what’s the point?

At this point, I suspect someone who is reading is asking a very logical question. The above described grade levels may make sense when scoring single assessments, but how does a teacher compile a student’s set of grades over the course of a grading term into a final grade for a report card? If the teacher is using the numbers 1-4 for grades on different assessments, does he or she then just average them for a final grade that represents that grading term? I would caution against averaging. If there are only four points on the scale, then averaging is bound to result in a fractional number that then requires the teacher to decide to round down to the level below or round up to the level above. That’s a lot of teacher discretion on a four point scale, so why not just fully embrace teacher judgment in the final grade decision? In other words, the teacher has to look at the full corpus of a student’s grades for that grading term and decide which description best fits their level of achievement. Did they meet all or almost all of the criteria during that grading period? If so, that's a 4. Did they meet most of the criteria during that grading period? If so, that’s a 3. This allows the teacher to consider the relative weight of different assessments in terms of the number or complexity of criteria assessed. It also allows the teacher to consider the way in which assessments later in the grading period were more cumulative in their measurement of important criteria. It also allows a teacher to consider the grade trend of the student. If a student got off to a bad start at the beginning of the grading period, but then, by the end of the grading period, demonstrated proficiency on most of the criteria for the grading period, shouldn’t that be considered in the student’s overall corpus of grades? I will emphasize, however, that this stage of teacher judgment in determining a final grade must be justified by the assessment evidence. Furthermore, this justification requires that the criteria of units be clearly articulated from the outset, and it requires summative assessments that actually measure whether or not students met those criteria (this is referred to as assessment validity; did the assessment measure what it was intended to measure?).

Before concluding this post, I now need to return back to the problem of “grade fog” that I raised at the beginning. All that I’ve described so far about criterion-based grading does not necessarily eliminate the problem of grade fog. Teachers could still include all sorts of different criteria within a unit. In my discussion above about KUDs, I made it clear that unit criteria should be limited to what a student is expected to know, understand and be able to do by the end of the unit. But there are still so many other things that go on in classrooms that teachers have a habit of including in grades. What about showing up to class on time with the necessary materials? What about effective collaboration with classmates? What about participating in class? What about completing homework? What about submitting things on time? What about recognizing the progress that students make? Some students may not meet all the criteria, but may make huge amounts of progress during a unit. Shouldn’t that be recognized in a grade in some way? The problem is that if all of these things are also thrown into the consideration of a single final grade, the meaning of that grade is lost, and the idea of it being criterion-based is also lost.

Thomas Guskey (2020) in a follow-up book to the one mentioned above called Get Set, Go!, suggests that the answer is reporting multiple final grades that serve different purposes (p. 117ff). He suggests that the things worth grading and reporting in the classroom can be placed into three categories. The first he calls the Product category. This is the term he uses for the criterion-based grading I’ve described above. The product grade will be the one that reflects whether or not students met the unit criterion of a grading period, which have been predetermined and articulated as KUDs. Note that Tomlinson and Moon also discuss three categories for grading, but they refer to this first one as the Performance grade, which is a term I tend to like better. Guskey’s second category is what he calls the Progress category. The progress grade will reflect the learning progress a student has made during that grading period, regardless of whether or not they met the criteria. Finally, Guskey suggests a Process category, by which he refers to the student behaviors, attitudes and competencies that facilitate learning. Together they make up the 3Ps of grading. In each case, these three different grades should still be criterion-based, meaning that the criteria (ie. goals or objectives) associated with that grade should be articulated clearly and transparently at the outset, so that teachers are clear about what they’re assessing and grading, and students know how they’re being assessed and graded.

I will conclude this post by pointing out that what I’ve outlined above is not easy in practice, and as I seek to put it into better practice, I suspect I’ll have future thoughts of refinement. My previous school had some of the above grading practices in place for a number of years, but I was not always effective at using these grading practices. This was in part because there were flaws in the school-wide set up itself, but it was in larger part because I didn't fully understand criterion-based grading, and I think that was true of many teachers at the school. I was often still trying to jam parts of my old understanding of traditional grading into this new structure. Here are a few of the takeaways that I’ve learned over the past year; these are takeaways based on what I was not doing well previously, despite teaching in a school that had made efforts to reform its grading practice.

First and foremost, my problem was that I often didn’t have a clear articulation of the criteria for a unit. I have long developed units using the UbD format, so I knew all about backward design and the need to unpack the understandings, knowledge and skills of a unit, but I often failed to do this at the level of detail and precision that would allow me to actually measure those criteria on assessments. Related to this, I think I was also unclear about the role of standards in criterion-based grading, and often assumed that the standards themselves were the criteria. Because of the way that standards are written, they are often difficult to measure on assessments without first unpacking them. My second weakness was definitely with my assessments. I’ve written about my thoughts on this front in a previous post. This is a vital piece of the backward design model, but one that I often neglected. It is essential to criterion-based grading that the summative assessments are clearly and completely aligned to the KUDs of the unit. Otherwise, how can a teacher actually assign a grade to a student that refers to their proficiency on those criteria? My final main weakness was in my implementation of the other two Ps of grading: the process grade and the progress grade. First, my school didn’t really do anything with a progress grade, though there were some attempts at involving students in goal-setting, which I also didn’t do very effectively. Second, I didn’t pay much attention to the process grade. I didn’t have a clear articulation for myself regarding the criteria for this grade, and I didn’t communicate anything transparently about this grade to students. As a result, I more or less assigned these process grades at the last minute prior to a deadline for grades, and I did it more or less based on my general impressions and observations of the student. As a result of this approach, I didn’t take these process grades very seriously, nor did my students, and so they became just a bureaucratic entry in a grade book.

My current school still uses the traditional grading practices I described in the last post. Next year we’ll be embarking on a process of reforming those grading practices, but, at least for next year, I’m still stuck within the larger paradigm of traditional grading practices. I’ve given some thought to some work-arounds, however. For teachers who are in a similar situation and looking for ideas of how they might make reforms of their own practices, while trying to push towards reforms at the school level, I’ll post my thoughts on “hacking” the traditional grading paradigm in the next post. For other readers just looking for what all of the above might look like in practice, I think the next post could also be helpful.

Nathan Haines https://onteachingandlearning.com

Criterion-Based Grading: The Alternative to Bad Traditional Grading Practices

An Example of Criterion-Based Grading

Problems with Traditional Grading Practices