Problems with Traditional Grading Practices

Grading

Jun 11

Antique school report card from ArlingtonCollege, circa 1988. — Image from wikipedia entry “Report Card”; shows a report card from Arlington College in 1899.

It's been months since I posted anything to this blog, but I'd like to jump right back into my obsession with grading and assessment practices. In my last post, I described how I've long been frustrated with my grading practices as a teacher, but discovered that, in rethinking grading, I first had to rethink the way I approached student assessment and my assessment data collection practices. In this post, I'd like to finally dive into the actual practices of grading. I've found myself teaching at a new school this semester, where traditional grading practices are still the norm (though, to the school's credit, it has embarked on an assessment and grading reform plan). My daughter has also found herself navigating traditional grading practices at her new high school this year. As a result, issues around grading have continued to occupy my thinking. In this post, I'd like to dive into the issue of grading by starting with a critique of traditional grading practices. I should note that these are traditional grading practices in an American education context, and have influenced grading practices in K-12 schools in other parts of the world as well. Though not identical, I will also note that these practices were very similar to those used in my Canadian schools growing up. Of course, that does not mean they're universal.

By traditional grading practices I mean the system of grading that many of us adults are familiar with from our own school experience, and that is still common in many schools today. This typically involves a teacher assigning, collecting and scoring various student assignments or assessments, including tests, quizzes, projects, homework assignments, etc., and then averaging all of those scores together for a final percentage grade. Depending on the teacher, sometimes each assignment is given a raw score out of a certain number of points, then the raw score is converted into a percentage, and all the percentages are averaged together. In other cases, a teacher may record only raw scores for assignments throughout a grading term and then convert the ratio of total-points-earned to total-points-possible into a percentage grade at the end of the grading term. Teachers often create weighted categories for different types of assignments, so that some are worth more than others in the final percentage calculation. For example, a teacher may make tests worth 40%, projects 40% and quizzes 20%, or use some other combination of categories and weights. Some teachers may accomplish the same thing simply through assignment raw score point totals. For example, a teacher may design tests out of 40 points, and quizzes out of 10 points, thus making quizzes one-fourth the value of tests.

However teachers set this up, in the end all assessment scores or grades are averaged together for a final percentage grade and then letter grades are assigned to percentage ranges. A typical scale would involve 90%+ as an A, 80-89% as a B, 70-79% as a C, and 60-69% as a D. Some grading scales are then further subdivided with +/- grades so that, for example, percentage grades in the low 80s are a B-, mid-80s are a B and upper 80s are a B+. On a typical traditional grading scale 60% is the pass mark so that anything below a 60% receives an F for "Failing." Just to add another step to this traditional grading process, schools often calculate a Grade Point Average (GPA) in order to show an average grade across all of a student's classes. Most GPAs are calculated on a 4 point scale with students receiving 1 point for each class grade of D, 2 points for each C, 3 points for each B, and 4 points for each A. These points are averaged together and usually run out to a couple of decimal points to show a student's overall average grade for a semester, or for the student's cumulative high school career.

All of this algorithmic calculating provides a veneer of mathematical objectivity when it comes to final grades for a class. Teachers and schools appreciate this sense of objectivity in grading. If a student or a parent questions a grade, for example, a teacher can simply pull out the grade book, crunch the numbers, and say, "look, it's just the math." There's the belief that this eliminates any problem of teacher bias in final grades. Teachers are even further removed from the process with electronic grade books, which do all the calculating for them. The problem, though, is that this sense of mathematical objectivity is a deception, one that most teachers know only too well. Teachers and schools keep up the deception to avoid the awkward conversations with students and parents about what really "goes into the sausage," so to speak, when it comes to grades.

When it comes to traditional grading practices, teachers do all sorts of things that result in final grades that are far from objective mathematical calculations. A teacher may offer optional extra credit, so that some students receive points that others don't receive. A teacher may provide some students with opportunities for additional assignments, or retakes on a test. A teacher may drop a student's lowest score. A teacher may create a project or other type of assessment that is intentionally quite easy in order to help students "boost" their grade. A teacher may adjust the weights of assignments mid-semester. A teacher may "curve" a test on which students performed particularly poorly, which basically means adjusting the scale of the test after the fact. A teacher may have a weighted category called "participation," which serves as a little cushion for the grades of students who need it. Many teachers recognize that giving a zero to a student on a missed assignment has an overly punitive effect on a final average percentage grade, and so that teacher may change those zero scores to something like 50% before the final grades are calculated. Teachers may round percentage grades up or down, particularly when a student’s final grade is close to a cut-off point between two letter grades. I'm not arguing that any of these practices are necessarily bad, I'm simply pointing out that the process of coming to a final percentage grade in traditional grading practices is far from a simple and objective mathematical calculation, and, though teachers might not want to admit it, much of the final grade remains within a teacher's professional judgment of what an appropriate grade should be for that student.

Furthermore, the traditional grading process I've described above involves averaging scores or grades to get a final percentage grade. By averaging, I'm referring to determining the arithmetic mean. Anyone who's taken a statistics course knows that the arithmetic mean is not always the best measure of central tendency for a set of data. For a non-normative set of data, the mean is very sensitive to the tail of the data, and especially sensitive to data outliers. In these instances, the median, or even the mode, are often better representations of the data as a whole. An individual student's set of grades for a grading term is not likely to be normatively distributed. This would require a student to have mostly Cs, then a lesser, but equal number of Bs and Ds and then a still lesser, but equal number of As and Fs. But that would be a pretty rare data set for a student's individual grades over a grading term. Something much more typical would be an examples of a student with three As (let's say 96%, 95%, 94%), four Bs (89%, 87%, 86%, 84%), two Cs (77%, 75%) and one F (let's say it was a really bad test that resulted in 40%). In this case of a non-normatively distributed set of data, the mean is going to be skewed lower than the median. Specifically, the mean would be an 82% (a B-), while the median would be an 87% (a B+). So which is the better overall representation of this set of data? Since the point, in this case, is to provide a measure of central tendency that best represents an overall grade for this student, when I step back and look at the data, I'm compelled to say that the median is probably the better representation. My argument, however, is not that we could reform grading practices merely by replacing the mean with the median for final grades. My point is to further problematize the perception of mathematical objectivity in traditional grading practices.

Speaking of the problem of averaging and the sensitivity of the mean to outliers in a data set, I must take a moment and address the problem of giving grades of zeros. This has been a debate among teachers for some time. Some argue that a grade of a zero (for missing assignments, for example) is unfair to a student because of the extreme outlier nature of a zero within a data set of grades that are averaged together using the arithmetic mean. For example, a student with nine straight assignment grades of 85%, but then one zero grade for a missing assignment, will end up with a 76.5% final grade. That one assignment, because it's such an extreme outlier, has significant influence -- unfair influence -- over the mean. As a result, some teachers in contexts of traditional grading practices input nothing less than a 50% in their grade books. They argue that the percentage range for each of the other letter grades is 10 percentage points, so that should also be the case for an F. In other words, 50% becomes the grading floor, making 50-59% the range for an F. While this logic makes a lot of statistical sense, it's also a hard pill to swallow for some teachers because, viewed from a different perspective, it amounts to giving students half the points for something that they didn't do; it's giving them credit for nothing. I would argue that this debate somewhat misses the point. The problem is less about whether or not giving a zero is fair, and more about the whole system of traditional grading in general. If we stopped using percentage grades and stopped calculating final grades by averaging, the problem of the zero would no longer be a problem at all.

Beyond the problem of perceived mathematical objectivity, there are actually more important issues with traditional grading; these go down to a more pedagogical level where we uncover some antiquated and troubling assumptions about teaching and learning more generally. The first pedagogical problem with traditional grading is the associated assumption that teaching is merely an act of transmitting a canon of knowledge from the teacher to the student. Learning, in this paradigm, is a matter of receiving that information from the teacher and demonstrating reception by remembering enough of it on the test. In this view, the teacher is the knower of knowledge and it's the teacher's job to deliver that knowledge to the student; teachers often refer to this in terms of "covering the content" of a course. Once the content has been covered, the responsibility lies with the student to "learn" it, by which one means that they must study their notes and remember the content for the test.

This paradigm of teaching and learning has implications for grading. The teacher anticipates a few really attentive, hard-working and intelligent students in the class, who will engage fully in this transmission of knowledge, take copious notes, memorize those notes, and demonstrate high levels of remembering on the test. In fact, the teacher expects a few of these students to manage to remember 90% or more of the content for the test and will thus score As. For the students who do this consistently, they will receive an A in the course, representing that they "know" at least 90% of the course content. Meanwhile, the teacher anticipates a few very inattentive and lazy students. They will tune out much of the teacher's delivery of content, take limited notes, spend little time memorizing those notes, and demonstrate very low levels of remembering on the test. The teacher will assume that these students had limited academic potential to begin with. These students will end up failing the test, and if they do that consistently, they will fail the course. What constitutes failing? If a student is unable to remember at least 60% of the content. The teacher then expects that most of the students in the class will fall somewhere in between these two ends of the spectrum. In other words, why is 60% considered the cut-off between a passing grade of D and a failing grade of F? Because, at some point, given this traditional transmission of knowledge pedagogical paradigm, some educators decided that, in order to receive a passing grade and receive credit for "knowing" the content, students had to demonstrate that they knew -- by which these educators meant "remembered" -- at least 60% of the content of the course. There’s nothing magical about this 60% cut-off mark. When I went to high school in Canada in the mid-’90s, the passing cut-off was 50%. We didn’t use letter grades, but rather left grades just as percentages, so someone within the provincial department of education decided that remembering at least 50% of the content of a course was sufficient for passing. The point is, these are arbitrary scales rooted in a content coverage paradigm of teaching and learning.

Even many very traditional teachers and schools today would argue that pedagogy is about more than covering content and remembering information. Many educators would point out that teachers must help learners come to understand big ideas or concepts and develop skills. Some would argue that knowledge must be constructed in the mind of the learner, or that learning requires the integration of new information with the already existing schema of the learner's brain. Teachers might advocate for the importance of students learning how to learn, or they might talk about developing lifelong learning competencies, or about learning for transfer, rather than just memory recall for a test. They may also discuss the importance of developing critical and creative thinking skills, and collaborative aptitudes. All of these would be strong, pedagogically sound arguments with which I would agree. The problem is that many of these same educators will then turn around and apply a grading practice that's still stuck in the old paradigm of measuring the percentage of covered content the student can remember.

For the second pedagogical-level problem with traditional grading, I have to return to some statistics and data distribution curves. This may become a little confusing, because above I pointed out that the data set of an individual student's grades over a grading period are unlikely to follow a normal distribution curve. That's true. However, traditional grading practices do assume that within a typical class of students, the academic performance within that class will be, roughly speaking, normatively distributed. If I break down what I mean by this, I suspect that most teachers, and probably most adults generally, will find that this aligns with their assumptions about classes of students. What would a class look like if the student grades within the class followed a normal distribution? Well, there'd be a few very smart, hard-working students who would achieve As, while there would also be a few lazy students with limited academic potential who would end up with Fs. All of the other students would fall somewhere between these two ends of the spectrum; there'd be a cluster of students in the D range and probably a similarly sized cluster in the B range, and then the largest concentration of students would fall in the C range. For most people this break-down of a typical class of students just makes common sense. In fact, we commonly refer to the idea of a "C-student," by which we mean, a middle-of-the-pack, typical sort of student. They’re someone that doesn’t stand out as a particularly great student, nor as a particularly poor student. They sit simply in the center of the data set; they're average.

This brings up the issue of bell-curves and grading on a curve. From my observation, most teachers are uncomfortable with the idea of grading based on a bell curve. In fact, the idea of a bell curve is taboo for most K-12 teachers. Formally, grading on a bell curve involves taking the raw scores of a class after an assignment or test has been scored, fitting those scores to a normal distribution curve, and then assigning letter grades based on where the scores fit on the curve. Those few on the tail on the right are given As, those few on the tail on the left are given Fs, those clustered around the middle receive Cs, those above one standard deviation get Bs and those below one standard deviation get Ds. I don't think many K-12 teachers actually grade this way; in fact, most would be repelled by the idea of it. However, the underlying assumption of a bell curve still remains in traditional grading practices. Teachers and schools still expect that the bulk of students in a class will be average students and receive Cs. There will be some above average students, some below average students, a few exceptional A students, and, finally, a few hopeless F students. In other words, regardless of whether or not they want to admit it, teachers and schools using traditional grading practices are indeed grading on a bell curve.

Discussions of grading and bell curves in education conjure up controversies around IQ tests and racial and cultural bias in those tests. Let me be clear that I don't think human intelligence can be defined as a single trait, nor do I think that it can be objectively measured in some culturally neutral way. However, for the sake of argument, let's just assume for a moment that intelligence can be measured objectively, and that, when it is, the data set is normatively distributed. Even if this was actually the case (again, just for the sake of argument), this should still not correspond to a normative bell-shaped distribution of student grades in a class. Normal distributions of data happen with random, naturally occurring, independent events. Learning in a classroom should be anything but. If a student's academic achievement in a classroom is based purely on some naturally occurring trait in the student, and completely independent of anything the teacher does, then what's the point of the teacher?

This ties back to my previous point critiquing the assumption that teaching is merely an act of covering content and transmitting information. If that was the sole role of a teacher, then one may be able to make the case that student grades are based on random, naturally occurring events, independent of teacher action in the classroom, but I believe that most teachers today have a more sophisticated view of pedagogy. Most teachers believe, at least on some level, that the job that they do in the classroom makes a difference; they believe that the instructional decisions that they make, the resources and materials they develop, the motivation and inspiration they muster, the scaffolding and supports they provide -- that all of these things can positively impact student learning. If we believe that a teacher can positively impact student learning in the classroom, that the act of teaching can make an intervening difference in levels of student learning, then it would be faulty logic to also believe that measurement data of student learning in that classroom would distribute along a normal curve. So why do we continue to use a grading system that is based on the assumption of a normative distribution curve?

Put another way, the problem with traditional grading practices is that they're not actually about measuring what students have learned; rather, they're about sorting students by comparing them with each other. Traditional grading practices are normative grading practices. There are no determined criteria against which students are measured; instead, students are compared and ranked against each other. A traditional grade fails to communicate to anyone what the student actually knows, understands or is able to do; it merely communicates how that student ranked in comparison with his or her peers within a specific class or within that school. College admissions departments know this. What does a 3.8 GPA mean for a specific student on a college application? Well, it means that they did relatively well vis-a-vie their peers in their specific high school, but how does that compare to another student with a 3.8 GPA from another school. Who knows? And this is why college admissions departments look at so much more than high school GPAs.

I think the argument that I’ve laid out is a pretty damning indictment of traditional grading practices. Despite appearances, these grades are not actually mathematically objective. Even if they were, the practice of using the arithmetic mean for calculating a student’s final grades is problematic. More importantly, the use of traditional grades is rooted in an out-dated pedagogical paradigm that teaching and learning are merely about information transfer and remembering, and they’re a normative-based measurement scale designed to compare and sort students within their peer group, rather than to communicate what the student knows, understands and is able to do. At this point, one might argue that any attempt to measure human cognitive processes like learning will be flawed and limited and so we just have to make do. To the first point of that argument, that there will always be limitations when it comes to measuring student learning, I will agree. However, there are far less limited and less flawed methods for doing so. I’ll be discussing these in a future post; for now, I will simply conclude that I think it’s time that the traditional grading practices that I’ve described above be retired completely. Not only are they limited and flawed; I think they’re actually harmful to student learning.

Nathan Haines https://onteachingandlearning.com

Problems with Traditional Grading Practices

Criterion-Based Grading: The Alternative to Bad Traditional Grading Practices

Student Assessment as Teacher Research