Comments from Brian
Cumulative calibration
Congratulations Akash on this.
Can we get our hands on an existing specification for a peer assessment system that could be used as a starting point - perhaps a detailed description of the Moodle one? I have used it and found it fairly good. The one thing I think could be usefully added is some measure of competence of reviewers.
Off the top of my head I can think of 3 measurements that would be needed to be kept for each reviewer.
1. A measure of their tendency to err on the high or low side. This could be as simple as a percentage or number between -100 and + 100 (when grading out of 100)
2. A measure of the variability of their assessments. I'm not a statistician so i'm not sure how this might be expressed, but the idea is to quantify their tendancy to vary in their inaccuracy. A reviewer with a high error tendency but low variability is easier to adjust for than one with high variability.
3. Finally, some measure of our confidence in the above 2 measurements. Opportunities will constantly arise for modification of a learners calibration scores above. As the calibration scores are tweaked, each new piece of information carries less weight. A way to handle that might be that each type of calibration has a particular score - eg. I might be calibrated against 3 of my peers assessing the same work. That might have a confidence score of 5. A tutor, perhaps at random, might grade the same piece of work and my ability calibrated against that with a score of 10. I might have my score compared to another student who has already been calibrated by a tutor and calibrated accordingly and this might have a value of 3. After these 3 calibrations, the confidence of my calibration might be scored at 5+10+3 = 13. If another tutor calibrates me, their calibration would be weighted with the existing one in a ration of 13 to 5. Feedback on reviews might also be included. This is not well thought out but I hope you get the idea.
In terms of the algorithm (and I think I've mentioned this before) I think it would be efficient if a tutor could grade an assignment and learners who assessed the same piece of work would be calibrated from this (added confidence 5?). Then their grades of other assignments would be compared with other learners and their ability calibrated (confidence 3?) - this could then be repeated and the next calibration might have a confidence addition of 1. I hope you get the idea.
The idea of assigning a confidence measure to calibration would work best over many assignments so it may be necessary to have a mechanism for transferring between courses.
You would have to be sure that the algorithm did not end up doing silly things like getting into a recursive loop particularly with positive (negative?) feedback.
I can immediately think of a silly outcome where the learner could end up with a score greater than a tutor. Maybe tutor's have a score of 80 and learners move asymtotically towards that score (Perhaps we will be able to prove that some learners can be more reliable than tutors - now there's a challenge - have some measure of the ability of tutors built in as well)
Apologies for the stream of consciousness. I do believe that peer assessment will prove to be the most powerful tool in our arsenal eventually for cutting the cost of accredited education. I may be wrong.
Brian
Comments from Mika in response to Brian[edit]
Evaluation reliability
To add to Brian's ideas: if you have a clear rubric for evaluators to follow, you'll improve the reliability. You could even eventually build a library, for each task, of sample work exemplifying different levels of rating, so that everyone can calibrate themselves. But if all you do is ask people to evaluate on a 10-point scale, you're going to get a lot of variability.
Mika
Response to Mika's comments
Agreed -- the system will need to cater for clear rubrics with criteria and specifications of what is required for each grade level to improve reliability. I also think using broad bands of performance eg Unsatisfactory = 1 - 4; Acceptable = 5 - 6 and Excellent = 7 - 10 would mitigate against some of these challenges. There is always and element of subjectivity when humans are expressing value judgements, even in rigorous systems. I also think that its pedagogically sound to incorporate a self-evaluation. Should the system detect significant deviations of the peer-evaluations when compared with the self-evaluations - these evaluations could be flagged by the system.
Also, in the case of using peer-evaluation to assist with scaling feedback for formative assessment it is possible to focus on more objectively verifiable criteria, for example "did the post meet the minimum word count" (taking into account that we could automate this kind of criterion later in the process) or "did the post respond to the three questions".
In the OCL4Ed course, for instance, we have specified a minimum number of substantive blog posts in order to qualify for certification of participation. Sometimes the learners submit the wrong url for their posts. At this time checking the urls is a manual process and is not scalable. Peer evaluation could assist in scaling the implementation of this kind of participation metric.
Response to Brian's feedback
Brain - thanks for that feedback - valuable ideas and foundations we need to incorporate into the design of the fist step in the project.
An important facet of this opportunity for comment is to gain sufficient information to help Akash develop specifications for the first prototype. We follow an incremental design approach at the OER Foundation whereby we focus on small steps of implementable code following a learn by doing approach.
The idea of developing mathematical models to determine the "confidence of the reviewer" over time is important, similar to Jim's concept of developing a karma system.
Once we get to the point of implementing a specification for the first prototype, I would like to make a call to OERu partners to identify one or two statisticians or applied mathematicians who could advise on decisions for the mathematical model for the first iteration.
We will also need to be realistic - the GSoC project only allows for 3 months of coding and its unlikely that we are going to be able to build a solution which is going to address all the complexities associated with peer evaluation systems. However, I guess that we will be able to build something useful as a first step towards the next iteration.
A colleague just sent me a link to this paper which is relevant: "Tuned Models of Peer Assessment in MOOCs" I have not got around to reading it but thought it would be useful so I'm posting it immediately. http://www.stanford.edu/~cpiech/bio/papers/tuningPeerGrading.pdf
Chris Piech Stanford University piech@cs.stanford.edu Jonathan Huang Stanford University jhuang11@stanford.com Zhenghao Chen Coursera zhenghao@coursera.org Chuong Do Coursera cdo@coursera.org Andrew Ng Coursera ng@coursera.org Daphne Koller Coursera koller@coursera.org
Brian
Thanks for posting the link to that paper!
As a technician, it suggests several elements that need consideration:
- having students review submissions that have been reviewed by "experts" (ground truth) which is a variation on Mika's comment about a library of sample works
- partitioning reviewers by native language in an attempt to remove that bias
- recording "time spent grading" a submission is challenging in a distributed environment like the OERu courses that have been offered to date
- (Their "sweet spot" of 20 minutes spent grading an assignment sounds like a significant time commitment for our mOOC assignments.)
- if karma is used, it maybe necessary to factor the marks an evaluator has received, not just those he has given (and had commented on)
- a large discrepancy in scores might signal the need to add additional reviewers of a particular submission
- how to present scores in a meaningful way especially if there are different weights being applied, or some evaluations are discarded, etc. in an environment where individual evaluations are open