Autogeneration of the expected answer?

Autogeneration of the expected answer?

por Heli Virtanen -
Número de respuestas: 3

Hi!

I am working with mainly Matlab (Octave should be close enough) teaching, where I want to create the "expected" answer by running a piece of code, and then run the student code and compare these two answers (by other ways). I can get both of the codes (teacher and student) run, but the problem comes with that all output of the codes is automatically passed on to "got", and I can't seem to figure out how to separate that one of them is the "expected", and the other "got". I thought that passing it trough Python and using subprocess-command separately on these two pieces of code, but then run into a problem with the execution command for Octave from within the prototype question.

Basically, my idea is based on using a Python to run the codes and then grade using the outputs as a guideline (own template for grading), based on the documented "Supporting or implementing new languages", or multilanguage questions, but I got stuck on that. Any help would be greatly appreciated.

En respuesta a Heli Virtanen

Re: Autogeneration of the expected answer?

por Richard Lobb -

Certainly you can use the answer to generate the expected answer. We have an option in our standard Python3 question type for this. We use the given test case expected value if it's nonempty or we run the answer otherwise. It's good to have at least one case where you manually provide the expected field, (a) to check your sample answer and (b) to provide a "for example" table.

It sounds like you're doing the right sort of thing, so I'm not sure what's not working. You need to use either a per-test-case template grader or combinator template grader. The latter is strongly recommended for production use because:

  1. It's much more efficient (involving just a single Jobe run for all tests if "Allow multiple stdins" is checked, rather than one run for each test case) and
  2. It's much more flexible, allowing any form of feedback.
However, a per-test-case template grader is much easier so I recommend that for a first step.

Here's the template for an Octave question type that uses a per-test-case template and fills in the expected output from the answer if none is given. The program to be tested, which is assumed to be a script rather than a function, consists of "clear;", the student's answer, the test case and the 'extra' field in the test. This question type has had negligible testing - I wrote it only as a proof of concept. It requires a reasonably modern Python (at least 3.6).

""" A largely untested proof of concept question type to show how an
   the test case expected field can be generated by running the sample answer.
   This is a per-test-case template grader Q - inefficient compared to a combinator
   template grader but easier to write.
"""

import subprocess, sys, json

def run_code(code, stdin=None):
    """Runs the given Octave code as a standalone program. The return value is
       a tuple (output, stderr) where output is the standard output from the
       program and stderr is the error output (and/or a message indicating what
       error return code was received)."""
    with open('prog.m', 'w', encoding='utf8') as source:
        source.write(code)
        
    failed = False
    flags = '--silent --no-gui --no-history --no-window-system --norc'
    command = 'octave {} prog.m'.format(flags).split()
    output = ''
    stderr = ''

    try:
        result = subprocess.run(
            command,
            capture_output=True,
            input=stdin,
            timeout=4, # Should be less than the Jobe timeout
            text=True
        )
    except subprocess.TimeoutExpired:
        stdout = ''
        stderr = 'Timeout expired'
    else:
        stdout = result.stdout
        stderr = result.stderr
        if result.returncode < 0:
            stderr += f"\nTask failed with signal {-result.returncode}"
    return (stdout, stderr)


# Run job
def do_testing():
    """Run the current test case. If the expected output is empty, run the
       sample answer code first.
    """
    error = ''
    test_expected = """{{ TEST.expected | e('py') }}""".rstrip()
    if test_expected == '':
        answer = """{{ QUESTION.answer | e('py') }}"""
        stdin = """{{ TEST.stdin | e('py') }}"""
        expected, error = run_code(answer, stdin)
        if error:
            outcome = {'expected': '', 'got': f"Error in question: {error}", "fraction": 0}
        else:
            test_expected = expected
            
    if not error:
        student_answer = """{{ STUDENT_ANSWER | e('py') }}"""
        test_code = """{{ TEST.testcode | e('py') }}"""
        extra = """{{ TEST.extra | e('py') }}"""
        code_to_run = f'clear;\n{student_answer}\n{extra}\n'
        output, error = run_code(code_to_run)
        if error:
            if output:
                output += '\n' + error
            else:
                output = error
        fraction = 1 if output.rstrip() == test_expected.rstrip() else 0
        outcome = {'expected': test_expected, 'got': output, 'fraction': fraction}
        
    print(json.dumps(outcome))
    
do_testing()

I attach an export of the question type and the one and only test question I've tried with it.

En respuesta a Heli Virtanen

Re: Autogeneration of the expected answer?

por Heli Virtanen -

Thank you for the quick reply, that was really helpful.

Unfortunately, our server is currently running on Python 3.6.9, which means your samples didn't run straight away, but I tried to make them work for that Python version and I think I got it correct, but I am now missing why the code run on the second (and so on) test run fails reading the input, as it does read it on the first one and the sample answer is filled correctly based on the exact same code as it it is the model solutions test.

So I'm not exactly sure what went wrong with the changes I made. Attached the modified code in the question, not the prototype, as I edited first the question and when I get that working, would update the prototype.