Using LLMs for qualitative feedback on student submissions?

Using LLMs for qualitative feedback on student submissions?

by Gregory Seront -
Number of replies: 14

I’m wondering if there are any ways to integrate LLMs (e.g. ChatGPT) with CodeRunner in order to provide qualitative feedback on students’ code submissions — in addition to the usual functional tests.

For example, I’d like to give students insights into their code style, structure, naming, or other aspects that go beyond correctness. Is there any existing functionality in CodeRunner to support this kind of integration, or has anyone experimented with such an approach?


Thanks!


In reply to Gregory Seront

Re: Using LLMs for qualitative feedback on student submissions?

by Henry Hickman -

This should be possible interfacing with most APIs. You can customize a question to call any API for an LLM, and put that feedback either in the results table programmatically, or in epiloguehtml within the question. 

Here's the OpenAI link for API integration: https://platform.openai.com/docs/overview?lang=python

The only thing I'm not positive about is ensuring your API key remains secret...

Sadly, this is all theoretical to me, I've never integrated AI into CodeRunner questions before, but I don't see any reason why it wouldn't be possible!


In reply to Gregory Seront

Re: Using LLMs for qualitative feedback on student submissions?

by Mike McDowell -
You could have a look at that Marcus Green's been doing with text and AI. I think would it be great to have AI generate feedback for tests (given strict prompts to avoid answer hinting, etc) or overall que feedback.

https://github.com/marcusgreen/moodle-qtype_aitext
In reply to Mike McDowell

Re: Using LLMs for qualitative feedback on student submissions?

by marcus green -
I have just returned to look at Coderunner (last logged in in 2017) and I noticed the reference to my work with LLM's and the AIText question type. I have also done some slight experimentation with a concept I call "Chinsert", meaning chat insert. The idea is a plugin that is designed to be used by other plugins, or via callbacks to insert prompt and response fields in any form. It is very much a high level concept at the moment but feel free to respond/contact me if anyone is interested. AI Text is getting some good feedback and is being used quite widely.
In reply to marcus green

Re: Using LLMs for qualitative feedback on student submissions?

by marcus green -
I have done additional research on using AI/LLM systems to evaluate student code submissions. Some of the leading model providers OpenAI and Gemini do surprisingly badly at this. While they can create correct runnable code when asked to evaluate code they seem to miss things that are fairly obvious to the human eye such as unbalanced braces and parenthesis. Howeve I have found the Qwen Coder models to be quite good at this, even the smaller models. I will continue my research into this.
In reply to marcus green

Re: Using LLMs for qualitative feedback on student submissions?

by Mike McDowell -
I've noticed (to delight) the Qwen models do review code fairly well given their small size and speed locally. I'm excited you're here and are taking a look at this.

I teach high school kids and think having some instant feedback (either during the tests as an additional column) or post submission on their code would help students quite a bit when working on assignments from home or when the quiz/test ends at the end of class and I simply can't get to all of them yet. Having immediate feedback has been a thing they all report as really helpful (been experimenting with various tests looking at exceptions/output/specific lines of code being present and including this in the test result column).

I host my own Moodle server for this (homelab) so have full admin rights and it's only myself and 2 others using my service. I'd be happy to test and implement in a real world setting with CodeRunner if you're tinkering.
In reply to Mike McDowell

Re: Using LLMs for qualitative feedback on student submissions?

by marcus green -
Hi Mike, what school do you teach at?
I have just spent US5 on some tokens from Together AI who provide access to some of the big Qwen models and it is very impressive. I am also running it on my home server which (hardware cost was GBP £640).
In reply to marcus green

Re: Using LLMs for qualitative feedback on student submissions?

by Mike McDowell -
Teaching at a high school here in Calgary, AB, Canada. Still on summer vacation here so I haven't really jumped back into any planning and prep yet.

Please let me know what you learn from the various models and their feedback in regards to coding! I've been rocking Open WebUI and Ollama here locally but am limited to 7B models. I've opened an account with Together AI (haven't heard of it) and am thinking I might trial some Qwen feedback on a few student assignments. Overall I use coderunner for assignments (unlimited attempts) then assess with quizzes and exams using SEB. I think the feedback may help when (if) they are working on some assignments at home and get lost.

For a workflow without modding any coderunner files I think after CR runs the tests in the template, I'll query together.ai with the student code, failed tests, and prompt (with a token limit) then display it as part of the test run output and see how that goes. As for what models to use, the Qwen3-235B looks affordable and seems to work well. Since I'm talking maybe 60-70 students in the intro class I'll give it a go and see what they have to say.
In reply to Mike McDowell

Re: Using LLMs for qualitative feedback on student submissions?

by marcus green -
Hi Mike
I have set up a course that demonstrates three of the Coderunner example questions, only translated to use my AI Text question type.

You can get instant access to it here

https://www.flossed.uk/mdl50/

It uses the Qwen LLM and you should be able to to make multiple attempts.
In reply to marcus green

Re: Using LLMs for qualitative feedback on student submissions?

by Richard Lobb -
This is very interesting Marcus. Which version of Qwen is that quiz using, and is it running on your home server or in the cloud?
In reply to Richard Lobb

Re: Using LLMs for qualitative feedback on student submissions?

by marcus green -
Hi Richard, the Inference is from the cloud with Together.ai and all my testing over the last week or so with it has cost me about 6 US cents so far. The model is a big one, specifically

Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8

I have a local "AI Tower" machine in my spare room with the following spec

Processor: AMD Ryzen 5 5600
Memory: Corsair 32GB (2x16GB) DDR4 3200MHz
Graphics: Palit RTX 3060 12GB
I have configured #Ollama and downloaded several models. It will load and run qwen:32b

I can give you teacher access and a playground course to the site at www.flossed.uk  if you are interested. And also run some of the full prompts against my local machine to check the accuracy (it will certainly be slow of course)...

Best wishes from (old) York.
You can also contact me directly at marcusavgreen at gmail.com
In reply to marcus green

Re: Using LLMs for qualitative feedback on student submissions?

by Richard Lobb -
Thanks Marcus. It's certainly an interesting project. We have run a couple of students projects involving hooking CodeRunner into an AI to provide help for students but we've only used the big commercial AIs (ChatGPT and Claude). I am mainly interested to know how the Open Source models measure up and I've heard very positive reviews of Qwen3.

Your focus is a bit different from ours. You seem to be using the AIs to assess a students plain text submission, whereas assessment for us (using CodeRunner to teach coding) is primarily about correctness of code. Correctness is much more reliably assessed by running the code.

We can potentially use AIs to assess style but we haven't been impressed by the quality of style assessment even by the best AIs. However, style assessment is very subjective and humans vary dramatically in their style assessments, too.

The other main use of AIs is to help students develop code but, most importantly, without actually writing the code for them. For that, I'm a great fan for students engaging with the AI in an on-going chat session rather than single-button "here's what you should do" type help.

It's an exciting space to be in, eh?!

Richard
In reply to Richard Lobb

Re: Using LLMs for qualitative feedback on student submissions?

by marcus green -
In my testing Qwen3 seems more suitable for programming related issues than any other model, with Claude coming second.

There is no substitute for running code through its native environment/compiler/interpreter. People should never forget that AI/LLM systems are actually very complete auto-complete systems that have no "intelligence" at all. That doesn't mean they are not useful, just that people should understand their limitations. However the word "should" there is very loaded as I know people will forget that.

Tim Hunt (co maintainer of CodeRunner for people other than Richard ....) has been talking in the forums leading up to the MoodleDach in a couple of weeks about AI and the question creation phase. I think this would mean a dialog within the question editing interface, concept I have played with a little. My apologies to Tim if I have mangled his concept.

It is a very exciting area and I feel that people committed to free software should engage both to bring benefits and to keep a sense of reality/perspective on these tools.

Now back to my CodeRunner related work today ....
In reply to Richard Lobb

Re: Using LLMs for qualitative feedback on student submissions?

by Mike McDowell -
As a proof of concept I've been experimenting template grading and querying the LLM's at the conclusion of testing and including the response as part of the 'epiloguehtml'. I've found the results using Together AI and the Qwen3-Coder-480B-A35B-Instruct-FP8 model have been pretty good. That said it's not exactly cheap either.

I have mixed results using our local ollama instance, primarily as we're restricted to Qwen2.5-Coder models 7B or less. That said with the server on the same subnet as jobe it's isolated from the web and quite fast.

Pros:
- Include specific prompt to avoid giving the answer and instead respond with guiding questions for the student
- Failed test cases can be included with the code
- Limiting to "Check" naturally restricts the number and speed of calls

Cons:
- Scaling issues with a large number of students
- As it's run with jobe, if there's a queue it can result in timeouts
- Formatting and terminology are inconsistent, especially for new coders

It'd be interesting to have a potential Moodle block that could tie both the LLM chat and pull the results from the latest jobe run in one space specific to the question.
In reply to Mike McDowell

Re: Using LLMs for qualitative feedback on student submissions?

by marcus green -
My experiments with the Qwen3-Coder-480B-A35B-Instruct-FP8 running with Together AI has cost me $US 1 cent a day over the last two weeks. I know that is a crude measurement but it does give some idea.

I have managed to get 32b Qwen running on my home server but inevitably it is very slow.