CodeRunner: Load-balancing the JOBE sandbox

Has anyone else tried running with multiple instances of the JOBE standbox behind a load-balancer?

We would like to do this for extra resiliancy.

If I understand correctly, processing one student submission can involve several calls from CodeRunner to Jobe (e.g. some put_files, then a runs, and these will all need to go to the same JOBE server. Am I right about that?

If so, I think it is doable. E.g. our loadbalancer can send a sequence of requests to the same server if they all have the same cookie. Therefore, I am wondering about doing something like:

In the qtype_coderunner_jobesandbox constructor, set $this->jobid = rand();
For each HTTP request, add either a Cookie, or a custom header, with a name like X-CodeRunner-Job-Id, and set to that value.
Configure our load-balancer to use that when routing requests.

Any comments on that?

Re: Load-balancing the JOBE sandbox

de către Richard Lobb- miercuri, 30 noiembrie 2016, 22:50

I've never tried it myself, but I like the idea.

The sequence of calls from CodeRunner to Jobe is roughly as you describe it. Note though that support files are cached on the Jobe server indefinitely, so running a test doesn't normally involve any file uploads except on the very first test case run of a particular question.

In a worst case scenario, multiple test cases need to be run for each question (no combinator) and multiple support files are needed. Each test case run is run independently as follows;

Compute the md5 checksums of all support files - these are used as file ids.
POST the test case to Jobe, containing the list of required file ids
If you get a 200 OK response, you're done. Otherwise ...
If you get a 404 File Not Found response, one or more of the required support files is missing, so:
PUT each file
POST the test case to Jobe (again with a list of required file ids).
If you get anything other than 200 OK, you're dead.

Since all test cases are independent, you don't actually need to run them on the same Jobe server although you do need to ensure that the entire sequence above uses just the one server. Running different test cases on different servers wouldn't currently give you a performance benefit because the test case submissions are synchronous. However, if you're going to have multiple Jobe servers available, it might be interesting to think about making the test case runs asynchronous so they could be run in parallel then re-sequenced as results came back.

The steps you describe for load balancing should work fine, but it might be better to put the line $this->jobid = rand() into the jobesandbox::execute() method instead of the constructor. That keeps alive the possibility of exploiting parallel test case execution if any of us ever gets around to making the submissions asynchronous.

Of course, since I've never tried this, I guarantee nothing. But I look forward to hearing how it goes.

Richard

PS: I forgot to mention that the process of running all the test cases for a question actually starts with a single GET request to find what languages the Jobe server supports. That doesn't change the load balancing game, although all Jobe servers in your cluster would have to support the same set of languages.

Re: Load-balancing the JOBE sandbox

de către Tim Hunt- joi, 1 decembrie 2016, 07:16

I think our interestes in doing this are motivated by

resilience (one server crashing does not stop students working).
scalability (more students being able to attempt questions at the same time).
maintenance (e.g. when you need to patch Linux, you can take one server down at a time, again without interrupting students).

rather than single-user performance. I would be very nervous about switching to asynchronous processing. The current code is simple, and therefore quite reliable, which are good properties to have. Getting asynchronous code right is hard. (Perhaps I am just a wimp.)

Anyway, I coded up this commit: https://github.com/trampgeek/moodle-qtype_coderunner/pull/24. (Thanks for the tip about where to set the job id.) However, we have not yet tested it with our load-balancer, so you may want to hold of merging it. Or, you coudl merge it now, and we can fix anything that needs fixing later.