CodeRunner: Some feedback from the front lines

Just want to post about my first use of CodeRunner for real. Our Intro to Object-Oriented Programming course went from two semesters to a condensed one-semester version to a half-semester rather intensive course. This pretty much precluded taking a week to mark each assignment (for 80 students) - hence CR for rapid-return results. But something curious happened. In the past, students would get their marked and commented assignments back and that was that as they knew that there was no way to improve their mark by whining. With CR however, a good mark is always possible, especially if one of their colleagues has already aced the assignment. This resulted in the marks being distributed as an upside-down Gaussian and an awful lot of cloned answers. Student cooperation evolved to where someone would sacrifice themselves on a question in order to get an acceptable answer which then propagated to the others. They were certainly learning something, just not a lot of OOP. So finally I had to tell them that exercises would be marked, to give them feedback on how they were doing, but not included in their course grade. This seemed to cut down significantly on the panic to get a good mark by borrowing code.

Initially, I tried having hidden test cases, but this led to a great deal of frustration as the exercises got more difficult, and to more shared code whenever someone stumbled upon a test-passing answer. Finally I've settled on something like: adaptive mode; all-or-nothing grading with a 0,10,20,... penalty scheme; the first test case shown as an example; all test cases visible; hide-rest-if-fail checked for all tests. I have a sample answer for each question, but that's mostly for testing the test cases; I don't share that with students.

Once the exercises get somewhat complex, just evaluating on whether the code gives the right answer is not really sufficient - I've gotten truly horrible code which passes all test cases. I'm currently working on getting a prototype template to run FindBugs for some proper static code analysis (maybe next week, DV & WP :^) ). I've already posted on some stuff I've cobbled up for testing things like replacing a loop with a forEach (Java8) but doing this case-by-case is really tedious. There's always a hole somewhere that pops up as soon as students get loose on it, much to their amusement of course (but does serve to illustrate to them the notion of incompletely-tested code).

Ideas? Suggestions? Awesome-ways-of-doing-things?

Re: Some feedback from the front lines

by Jenny Harlow - Monday, 7 November 2016, 5:53 PM

Hi Peter

That's really interesting so I thought I'd add my 2cents worth of experience moving from human marking of code to CodeRunner. In summary (1) I think that code-copying is often an issue on programming courses, but having the code submissions online makes it very much easier to spot and deal with; and (2) a short test (again, online) relatively early encourages students to see that it is better just to do the work for yourself.

We run a large (more than 600 students) first year university course teaching basic Matlab (and applications) to aspiring engineers. The students have to complete a 'lab' of tasks each week and in the original set-up all marking was done by tutors in a timetabled period in a computer lab. This means a lot of tutors and a lot of lab space and a lot of human marking... Students have tasks well ahead of the timetabled marking time and there is considerable scope for code-copying between students which is very hard for individual tutors to spot.

Moving to CodeRunner offers many benefits. We have done a mini (compressed in time, 20-40 students) version of the course twice with all labs on CodeRunner, but with smaller numbers a lot of the problems of either online or human marking go away naturally, so I'll focus on the big course. This year we had CodeRunner quizzes as part of the course assessment for the first time, replacing some but not all of the human marking of labs. Next year we could go to all-CodeRunner labs.

Students have responded very well to CodeRunner, but one of the other valuable things has been the opportunity to deal with copying -- which as I say always took place but was just hard to spot -- more effectively. Because all the submissions are are on-line there is more data to use, and more time to do it in. Richard kindly passed on some of the 'cheat-check' code he uses and we run this on every set of lab submissions. We advertised this well ahead of time and word got around very quickly that we really meant it after the first sets of copies got pinged!

Obviously, we would much prefer that copying just did not happen, and we certainly don't want to get into an internet security situation of way more time being spent stopping the bad stuff than encouraging the good stuff. We also don't want students to get trapped into having to continue to copy just for survival if they somehow got away with it at the start (and the absolute basics are arguably harder to cheat check). At present we have a programming test relatively late in the semester, but with CodeRunner it will also be possible to add another short test very early to give the wakeup call where necessary.

On the other issue you mentioned - truly horrible code that gets the right answer - that again has always been an issue for us and in our context I am not sure that the human markers were able to deal with it particularly well either. Theoretically they could refuse marks until it was done better, but in practice it was more complicated. In our context of in-lab marking the time constraints mean that that ideal may in practice turn into saying it is truly horrible but giving marks anyway... It is certainly one of my concerns whatever marking system we use. For CodeRunner my question prototypes allow me to specify a limit on the number of code statements they can have in their submissions and I have found that helpful in discouraging round-the-houses code. I can also specify limits on the number of for-loops and if-statements that can be used. Maybe most things you can do to encourage nice code with a large introductory programming class are a bit of a blunt instrument. What I do like is the prospect that with CodeRunner our tutors can spend their time helping students who do want to do things better, rather than just marking flat-out.

Like you, I'd be really interested in hearing more thoughts on both copying and distinguishing nice from not-nice code.

Jenny

Re: Some feedback from the front lines

by Richard Lobb - Monday, 7 November 2016, 9:59 PM

Thanks for the interesting postings, Peter and Jenny. It's great to get feedback from the front lines.

I've been teaching programming for nearly 40 years, and copying of code has always been a problem. It's unpleasant to talk about it in class and even more unpleasant to deal with offenders. I've tried the strategy of telling the students they can cheat to their heart's content, because they'll fail the exam as a consequence, but it doesn't work. Students who do their own work get cross when they see other students simply copying code and getting the same or even more marks. The mood and motivation of the class deteriorates and a culture of code sharing develops. So like it or not you have to go to the extra trouble of detecting it and jumping on it. We always penalise both the copier and copy-ee. As Jenny says, after you've hit a few people with zero marks, word gets around very quickly.

Aside from checking for copying, other things that help to reduce the problem are:

Construct lab and assignment work so that students can do it one small step at a time. If the initial step is too large and daunting students will be more or less forced to get "help" from their friends.
Don't allocate too many marks for course work. If you make the stakes too high, the temptation will be too large.
Use labs and assignments for formative assessment and invigilated tests and exams for summative assessment. You should expect students to get high marks on formative assessment items - it's the motivation to get those high marks that drives the learning. Emphasise throughout that the point of the labs and assignments is just to give them the skills to pass the real assessment (tests/exams).

Jenny mentioned a "cheat checker" program. We have a range of these. The most sophisticated ones compute similarity measures between all pairs of submitted code for each of a number of assessment item and look for students or pairs of students whose code gets repeatedly tagged as suspiciously similar. However, these are a lot of effort to administer and my preference is to use a much simpler program I call dumbcopiers.py. This is used on just a single item of assessment. It checks for pairs of submissions that are:

Byte-for-byte identical. [These are the really dumb copiers]. Or
byte-for-byte identical after all comments have been stripped and all redundant white space has been removed. Or
byte-for-byte identical after all comments have been stripped, all redundant white space has been removed and all non-keyword tokens have been replaced with a generic identifier xxx.

I attach the current version of dumbcopiers.py - it's configured only for C, Python and Matlab, so you'll have to tweak it for Java and for your own environment. To use it, first export all the responses for the quiz as an Excel or Libre Office spreadsheet (Quiz Administration > Results > Responses > Downoad), open it with Excel or Libre Office and save it as .csv with "quote all text fields". [Moodle provides a direct export as CSV but it doesn't work, or used not to work anyway, with cells containing arbitrary code.] Some fiddling around may be necessary to get a properly escaped CSV file containing students' names and their submitted code.

Once you've got that, you simply run dumbcopiers.py on it, after editing it to specify the file name, the question number you wish to check and the checking parameters.

On the subject of style: there will always be students who write rubbish code no matter what you do and there will also be the ones who genuinely care about their code and want to find nicer ways of doing things. For the ones in the middle you have to somehow communicate that you care about code quality, which means you have to put time and effort into it at some point. Having a code checker built into the CodeRunner question types makes a big difference. For Python we use pylint. It has improved code quality out of all recognition. But even a simple checker that can limit function size and line length is a big help. But at some point I think you do actually have to do human grading of style.

Thanks again for an interesting read

Richard

dumbcopiers.py

Re: Some feedback from the front lines

by Peter Sander - Tuesday, 8 November 2016, 12:57 AM

dumbcopiers.py - nice!

With pre-CR student Java code submissions in Moodle I had a (python) script to run their stuff through Sonar and return the results back to them for cleaning up. Then of course I'd get some still-truly-horrible-but-Sonar-compliant code. Oh well, it did serve to improve most of the submissions. Findbugs is a bit simpler to manage so I'm working on hooking that into CR. Doesn't do a cheat-check so I think a Java version of dumbcopiers is called for.

Our problem was partly due to the all-or-nothing-nature of the evaluation. Students with 6 passing tests out of 7 would be faced with a mark of 0 unless they got the remaining test to work, so with time running out there was a tendency to panic and to borrow a working solution. Given the short time-period for the course I didn't really have much possibility to play with different ways of running the tests. You're quite right that this was a mix-up between for-learning and for-the-grade exercises. I'll be trying some different strategies on another course now running with a less hectic schedule.

Re: Some feedback from the front lines

by Richard Lobb - Tuesday, 8 November 2016, 8:25 PM

While getting feedback from the front lines ...

Patrick Wang reported longish response times on Java questions. I've never used CodeRunner in a Java course myself, so I have no feeling for how much of an issue the response time is. Do you have any comments on this, Peter, or anyone else?

One thing I (or someone else) could do, on the Jobe server, is to retain compiled programs in a cache, keyed on language plus the md5 checksum of the source. That would prevent recompilation of the same source. This would help in the case of "Write a program" questions in which the only thing changing is the input data. All test cases would then run with only a single compile. But ... is this a common situation and are the delays that users currently experience unacceptable, anyway? I generally subscribe to the "if it ain't broke, don't fix it" philosophy.

Richard

Re: Some feedback from the front lines

by Peter Sander - Thursday, 17 November 2016, 12:23 AM

Anecdotally - I've not had any complaints yet from students about the time taken to execute their code. Lots of complaints about hidden test cases, but that's another story.

I'll post our server config and try to quantify performance...later.