CodeRunner: JobeInABox + Kubernetes?

Hello all,

We have been running CodeRunner at Aalto University since 2016 and it seems that we discover new ways to use it every year (thank you so much Richard and everybody else!). We are for example using it to check students' Excel spreadsheets on our chemistry and materials science courses.

As the number of CodeRunner courses and students is increasing steadily, we have tried to increase the scalability of Jobe by moving from one single Jobe server to a cloud-based JobeInABox + Kubernetes setup. Things are mostly working really well, but we have some issues related to Python3 exercises with support or attachment files.

Here is a schematic figure of the setup developed by our IT department:

Jobe-Kubernetes

The JobeInABox + Kubernetes setup has been in production use for over a year. It runs smoothly for questions which do not include any support or attachment files. However, questions with support or attachment files behave somewhat weirdly. Typically, when we create a new question, the first few runs produce Unexpected error:

Error

When we try again after a while (from few minutes to few hours), the tests run fine. After this, all runs complete successfully also for students. However, the same error may occur also later on, somewhat randomly. And we do have some pathological cases, which always return the same error (these have as many as five support files).

We have read the previous very useful discussions on Jobe load balancing back in 2016. There Richard mentioned that while individual tests can run on different Jobe servers, one single test should run inside one Jobe server. I wonder if this is still true today? In our Kubernetes setup, all Jobe containers share the same file cache directory (/home/jobe/files). But clearly we might be missing some crucial detail here as we get these errors that we did not see with a single Jobe server. Debugging the issue has been tough, as the error usually disappears after few retries.

If anyone has any ideas or advice related to a JobeInABox+Kubernetes setup, we would be very interested in hearing your thoughts!

Best wishes,
Antti Karttunen

Re: JobeInABox + Kubernetes?

by Richard Lobb - Saturday, 19 February 2022, 3:21 PM

Good to know you're finding CodeRunner useful and are discovering new uses for it. We're doing that too :)

It sounds like you probably understand how the file cache works with CodeRunner, but in case you don't ...

CodeRunner defines the ID of a file as the MD5 hash of its contents. To run a job with files, CodeRunner computes the IDs of the files it needs, plus the names those files should have when copied to the working directory. It then sends the job off without the files. If it gets a 404 response, meaning one or more files wasn't found in the file cache, it does a series of HTTP 'put' requests to upload the files and then repeats the original request. If it then gets a 404 response on this second submission, it generates an error like the one you have above. [The assumption underpinning this design is that most files are those supplied by question authors, used over and over by each student submission.]

So, indeed the symptoms you describe are exactly what I'd expect if the /jobe/home/files directory were not being shared. I have no experience of sharing a directory across multiple containers but is it possible that the containers themselves are caching the shared directory in their own memories, perhaps even very briefly? If using a write-through cache, say, the 'put' request might return to the Moodle server before the shared target directory has been updated so that when the job request comes through on a different machine, the file isn't there yet?

I realise that this shouldn't happen if a file system is being properly shared, but it's the only explanation I can think of, I'm sorry. If that turns out not to be the explanation I can only suggest that you find a way to direct all HTTP requests associated with a given Jobe request to the same container, using the X-CodeRunner-Job_Id request header, added by Tim Hunt for use at Open University. Here's his comment in the code relating to that:

    /**
     * @var string when this variable is set, it is added as a HTTP
     * header X-CodeRunner-Job-Id: to every API call we make.
     *
     * This is intended for use when load-balancing over multiple instances
     * of JOBE, so that a sequence of related API calls can all be
     * routed to the same JOBE instance. Typically a particular value
     * of the job id will not be used for more than a few seconds,
     * so quite a short time-out can be used.
     *
     * Typical load-balancer config might be:
     *  - with haproxy try "balance hdr(X-CodeRunner-Job-Id)" (not tested)
     *  - with a Netscaler use rule-based persistence with expression
     *    HTTP.REQ.HEADER(“X-CodeRunner-Job-Id”)
     */
    private $currentjobid = null;

An alternative that was added to CodeRunner recently (thanks Khang Pham Nguyen) is to set the Jobe server field in the CodeRunner settings to a comma-separated list of Jobe servers. One of those is selected randomly and the $currentjobid is used to ensure the same server is used for all requests related to a particular job.

Re: JobeInABox + Kubernetes?

by Antti Karttunen - Sunday, 20 February 2022, 5:37 AM

Dear Richard,

Thank you so much for all the help and ideas! I fully agree that in-memory-caching within the containers could very likely result in the issues we are encountering. Two weeks ago, we tried reducing the number of containers to one to see if this would solve our problems, but still the same issue persisted. We will now follow your advice and make our load-balancer aware of the X-CodeRunner-Job-Id header, to make sure that all steps in a test run are executed within the same container.

I will report our findings back to this thread. It was also great to hear that in the latest CodeRunner version there is the possibility to have simple load-balancing/redundancy just by listing several Jobe servers in the settings!

Best wishes,
Antti