Student Assessment as Teacher Research
Image generated by ChatGPT.
I mentioned in a previous post that my long hatred as a teacher for the task of grading has spurred me recently to start looking into ways that it can be done better. Not only have I always found grading student assessment work to be terribly boring, but I've also struggled with the very purpose of it. Of course, I understand the stated purpose of grading; it's about evaluating student performance in school, putting some sort of measure on that performance, and holding students accountable for school expectations. How will the students themselves, parents, subsequent teachers, and -- eventually -- university admission officers know the school aptitudes of students without grades, right?
There's a problem, though. I have found that, based on the way they're typically used in schools, grades don't actually accomplish those purposes. As much as they hold students accountable as a form of incentive, they just as often de-incentivize and de-motivate students. Though they're intended as a measurement tool of student abilities, they're more often used as a social sorting mechanism, or as carrot or stick to manipulate student behavior. Rather than communicate something meaningful about a student's academic achievement, grades just as often represent level of effort, or degree of compliance with a teacher's rules. Though the numbers, averages, percentages, and algorithms that teachers (or grading software) calculate to determine grades give an impression of mathematical objectivity and precision, in truth, grades are very subjective and plagued with bias. Paul Dressel, an educational psychologist at Michigan State University for many years, once summed up what I've long felt about grades: "A grade is an inadequate report of an imprecise judgement of a biased and variable judge of the extent to which a student has attained an undefined level of mastery of an unknown proportion of an indefinite amount of material." No wonder I hated grading and often felt downright "dirty" after submitting grades for report cards every quarter.
For a number of years I've taught the IB Business Management course. One of the units in that course is about marketing and there's a topic within the marketing unit related to market research. Though this is a relatively minor curricular component in the course, I have decided it's worth the investment of additional time. I have found it an important opportunity to explore social science research methods more generally, and to help students understand and think critically about research and data collection methodology when it comes to social science knowledge creation (it's also a great place for TOK connections in the IB Business Management curriculum). It's often very difficult for students to recognize why attention to primary data collection methodology matters. Students will put together a poorly worded survey of five questions, seeking information on a poorly focused and weakly targeted topic, email it out to a few of their friends, read through the results to see confirmation of some inclinations that they already had, and consider that they've discovered or confirmed some important knowledge about the world. I don't wish to be hard only on my students here; I think, especially in our polarized, echo-chamber, confirmation-bias-prone world, most of us adults do the same thing all the time.
In preparing to teach this topic of social science research methods last year, some questions occurred to me. To what extent was the job of the classroom teacher when it comes to student assessment the same as that of the researcher doing primary data collection? As a teacher, when I create, assign, collect and analyze student assessments, am I not collecting primary data? Is not this part of the teacher's job the same as that of the researcher? Am I not collecting data for the purpose of gaining some sort of information about my students, such as how much they've learned of a given topic? These questions then led me to some more troubling questions. How strong was my data collection methodology when it came to my assessment practices? What if someone truly trained and experienced in social science research methodology -- qualitative, quantitative, surveys, interviews, sampling, statistical analysis, etc. -- reviewed the methods and measures I used to arrive at student grades? Well, I expect they'd be appalled, and I suspect the same would be true if they reviewed the assessment and grading practices of many teachers. This brought me to a realization: my problem with grading starts back before any assigning or calculating of letters, numbers or percentages. I have to start my inquiry into better grading practices first with a look at my assessment data collection practices. Am I gathering the right data, and the right amount of data? Am I using the right methods to collect that data? Am I considering the best sampling techniques? This all has to come before any assigning of a score, letter grade or rubric category. And so, though I've ranted about grades at the outset of this post, I'm not actually going to address grades yet; instead I'm going to look at classroom assessment practices from the perspective of research data collection methodology. I'm going to argue the point that teachers are researchers of student learning in their classrooms, and therefore, they need to approach student assessment data using best practice research methodology.
As a short side note, I must also say that this idea of teachers as researches is very appealing to me. The part of me that wants to champion teacher leadership, teacher agency and teacher efficacy, also wants to embrace this idea that teachers don't just consume and transmit knowledge, but that teachers can be creators of new knowledge. I suspect there will be a future blog post coming about teachers as educational and pedagogical practitioner-researchers. I know that gathering and making meaning of student learning data is not exactly the same thing as creating new knowledge of educational and pedagogical theory and practice, but it still aligns with my vision of the classroom teacher-researcher.
My hunch about teachers playing the role of researchers when it comes to student assessment was validated by some reading about assessment I did this semester. Wiggins and McTighe, for example, in their work on backward unit and curriculum design, which they call "Understanding by Design" (UbD), talk a lot about assessment validity. Validity is an important term in research methodology as well. The question of validity in research is whether or not the data collection or measurement instrument actually gathered or measured what was intended. For Wiggins and McTighe, this is why "stage 2" of their unit design, which is all about summative assessment of learning, must follow "stage 1" of the unit design, which is all about the learning standards and objectives of the unit. In other words, stage 1 sets out what students will learn in the unit, and stage 2 sets out how students will be assessed on what they learn in the unit; stage 1 establishes the criteria by which student achievement levels will be determined using the stage 2 assessments. In order for the assessments to be valid, they have to be aligned to the stage 1 standards and objectives; they have to actually collect data that will measure the extent to which students achieved the stage 1 standards and objectives.
This link between student assessment and research data collection methodology became even tighter in my mind when reading Tomlinson and Moon's book, Assessment and Student Success in the Differentiated Classroom (2013). They too discuss the need for assessment validity and they go further to identify three other data collection concepts that are important to student assessment data: reliability, error and bias. Without getting too deep in the weeds when it comes to research data collection methodology (on which I'm no expert anyway), let me briefly define these research concepts and how they apply to assessment data for the classroom teacher.
In research, error refers to some kind of problem in the data because of a faulty or poorly constructed data collection or measurement tool. In a numerical or statistical sense, it refers to the gap between the value arrived at from the data versus the real value. Common errors could be due to poorly worded survey questions that were misunderstood by respondents, or a poorly constructed sample so that the data didn't accurately represent the whole population. It could also be due to variables that weren't considered or controlled. Similarly, in student assessment, error represents the gap between the what the student knows and is able to do as reflected in the data, versus what the student really knows and is able to do. Tomlinson and Moon (2013) list a number of causes of error in student assessment, including poorly worded questions, a student's misunderstanding of directions, a student's lack of fluency in the language of the assessment, or a student learning disability or attention issue. Error can also arise because the student didn't have time to complete the assessment, or the student simply wasn't feeling well that day. All of these issues can lead to a gap between the results from the data and what the student truly knows and is able to do.
Bias in research data collection arises because of the limited or faulty assumptions, perspectives and prejudices of the researcher. Researchers are human beings, and all human beings experience the world subjectively, filtered through their own culture, experiences, and identities. While human bias is inevitable and perhaps cannot be entirely eliminated from any research process, it is important in research to acknowledge it, and use strategies to reduce it as much as possible. Collaboration, blind experiments, and peer review, for example, are strategies for reducing research bias. Regarding inherent teacher bias in student assessment, Tomlinson and Moon (2013) write, "Teacher bias happens because teachers are people whose feelings, experiences, and expectations come to work with them every day" (p. 125). Despite it's inherent existence, teachers must work hard to reduce their bias by acknowledging it, triangulating data with multiple assessments, reviewing assessment data with colleagues, ensuring that assessments have clearly communicated criteria and that analysis of the data is based on that criteria.
Reliability in research has to do with the consistency of results. For example, could a different researcher use the same tool and methodology and achieve the same results? From a more natural science perspective, this is about replicability of the experiment and results; if you run the same experiment multiple times and get different results each time, the experiment has a reliability problem. This often relates to the sample design. Has the data collection sample been designed in such a way as to be generalizable to the whole population? In the case of student assessment data, when a teacher designs an assessment, they are creating a data collection sample. There is no way that a teacher can ask students to demonstrate what they know and can do in all possible scenarios. Instead, the teacher designs some sort of assessment that is a sampling of possible scenarios from which the teacher can generalize. A math teacher interested in assessing student ability to factor quadratic polynomials is not going to give students all the possible quadratic polynomials in the world to factor. Instead, they are going to give a sample of questions from which the teacher can generalize what the students know and can do. A teacher therefore must design an assessment that will allow for this generalization. Does the sample accurately reflect the whole? Teachers must also consider consistency of results across different sections of the same course, including sections taught by different teachers. Of course there will be differences as each group of students is different, but, if the learning criteria is the same, then the assessment tools should be the same (perhaps different specific questions, but the overall design should be the same, with questions that get at the same knowledge and skills so that the samples consistently represent the whole), and the results should be relatively consistent. The matter of inter-rater reliability is also important. There should be consistency in the interpretation and scoring of the same assessment across teachers; there should be no difference in how data is interpreted, and how questions are weighted or scored, simply based on which teacher is doing the scoring.
I, in no way, am going to argue that attention to the above four research methods criteria will lead teachers to some utopian reality of perfection and objectivity in student assessment. Research and knowledge creation will always be filtered through human subjective experience of reality, and is further complicated in the social sciences given that humans, in all our complexity and subjectivity, are the objects of study. But these limitations by no means suggest that research, including data collection on student learning, is meaningless. It also doesn't mean that it's pointless to attend to best practices in methodology. While the perfect student assessment tool may not exist, there are definitely examples of those that are much better than others. Tomlinson and Moon (2013) argue that teachers should always work to, as much as possible, "increase reliability and validity and (...) reduce error and teacher bias" (p. 123).
Up to this point I have not addressed the fact that student assessment data is gathered by teachers to serve different purposes. Assessment experts often refer to two main categories of assessment: formative assessment and summative assessment. Summative assessment is sometimes described as "assessment of learning." These are the assessments given at the end of some period of learning for the purpose of determining what the student learned and attaching a score or grade to that level of student learning. These are end-point assessments that "sum up" the level of student learning for that topic or learning objective before moving on to something new. Formative assessment is sometimes described as "assessment for learning;" this is assessment data gathered by the teacher to help guide the next steps of classroom learning. These assessments are sometimes more informal, and they help to inform the teacher on the questions: Are the students getting this? Which students are ready to move on and which need some additional practice with this? Are my instructional methods working to help the students learn this? What adjustments do I need to make in my instructional plans to ensure that all students are progressing? In short, formative assessment is not about student grades; rather, it "informs" the teacher on decisions about next instructional steps. A sub-type of formative assessment is known as "pre-assessment," which is about gathering data on what students know and are able to do prior to a unit of learning. Pre-assessments provide baseline data about the starting points of each student going into the topic or learning objective.
While I'm not ready to address the issue of grading specifically in this post, I do want to briefly point out one problem with common teacher practices around grading. To use the words of Tomlinson and Moon (2013), teachers often "overgrade student work" (p. 130); teachers often feel compelled to assign a grade to everything a student does, including homework assignments, participation in class, practice assignments, worksheets and draft work. This tendency of over-grading conflates practice with performance. To use a sports metaphor, the students need opportunities to practice, with the teacher providing judgment-free feedback and direction, so they can prepare for the the game-time performances that matter. Grades, therefore, should only be attached to summative assessments, not to those that are formative. Formative assessment serves a different purpose; formative assessment is about informing the teacher's next instructional steps, while also providing students with feedback on their progress and how they can continue to progress towards the learning objective.
Whether an assessment is formative or summative, however, quality assessment data collection methods are essential. While the purposes of formative and summative assessments differ, both must attend to error, bias, reliability and validity. Even when grades aren't a consideration, and even if the data collection is more limited in scope, more narrowly targeted and more informally implemented -- as is the case with formative assessment -- attention to error, bias, reliability and validity is vital. If the data is to be used to inform teacher decisions about next instructional steps so as to ensure that each student in the classroom progresses towards the learning objective, does this not demand that teachers use methods that ensure the highest levels of accuracy in the data being collected? As for summative assessment, it only makes sense that teachers will arrive at grades that more accurately represent student learning if those grades are derived from the scoring of assessments with low levels of error and bias, and high levels of reliability and validity. This doesn't solve all the problems with grades, but it's the required starting point; it's an insufficient condition, but it's a necessary one.
In short, whether it's about assigning grades, or about informing instructional decisions, teachers need to take seriously their data collection methods for student assessment, and I believe that a slight shift in teacher perspective can help to emphasize this. I believe it's important for teachers to understand their role as researchers of students. Teachers are curriculum writers, instructional strategists, student motivators, and so much more, but they are also researchers who are daily collecting and interpreting data on students and student learning.