Parallel submissions failed test

Parallel submissions failed test

by Michael Lim -
Number of replies: 3

Hi there,


We are currently trying to set up our server to allow more parallel submissions in preparation for this semester's finals, since we expect a lot of students to submit towards the submission deadline. Currently, I have changed the Jobe config parameters in application/config/config.php to be as follows:


$config['jobe_max_users'] = 50;
$config['cputime_upper_limit_secs'] = 30;

Afterwards, I reinstalled the Jobe server with sudo ./install and tested with the provided testsubmit.py, increasing NUM_PARALLEL_SUBMISSIONS to 50.

However, when I run the test and the test reaches the "Checking parallel submissions" phase, some jobs will fail to compile with the following error:


{
  'run_id': None,
  'outcome': 11,
  'cmpinfo': "/var/www/m2/jobe/application/libraries/../../runguard/runguard: illegal user specified: 974\nTry `/var/www/m2/jobe/application/libraries/../../runguard/runguard --help' for more information.\n",
  'stdout': '',
  'stderr': ''
}

Is this possibly related to the changes in version 1.4.2, which includes a bug fix for Bug fix: Jobe server overload was being incorrectly reported as a Runguard error ("No user jobe-1")?.

Many thanks, Michael
In reply to Michael Lim

Re: Parallel submissions failed test

by Richard Lobb -

The error is simply because you've configured Jobe with way more users than I was expecting! Line 14 in the file runguard/runguard-config.h defines a list of users that runguard will accept as submitters of jobs. It only goes up to jobe19. You can add users jobe20 - jobe49 if you really want to try with 50 users.

I'll be interested to hear what results you get. My expectation is that throughput will actually drop if the number of users is much more than the number of CPUs on your server. And at some point performance may well collapse completely as the server starts to thrash. However, it will depend on the language you're using and the mix of jobs. So please report back what you find.

In reply to Richard Lobb

Re: Parallel submissions failed test

by Michael Lim -
Hi Richard! Thanks so much, it works well now.

We are running the Jobe server together with the Moodle server secured under a firewall, so it is definitely a concern. I dialled back the number of runners to 30, which should be enough considering we are only using CodeRunner for one Python3 course of 300 students. I don't expect the code to be complicated either since it's an entry level course, so runtime should not be a concern.

Will definitely let you know the results after the finals.

Michael
In reply to Michael Lim

Re: Parallel submissions failed test

by Richard Lobb -

Good to hear it's going now.

I perhaps should clarify the meaning of those constants you changed, as they probably don't do quite what you think. JOBE_MAX_USERS sets the number of simultaneous jobe tasks that can be running at a given time, but that's actually not the point at which the system overloads. If additional jobs come in when no jobe users are available they go into a polling loop, sleeping for one second every second, waiting for a free jobe user. Only after MAX_RETRIES fails (currently 8) does the job return a RESULT_SERVER_OVERLOAD failure condition. So Jobe is probably more robust in the face of sudden peak demands than you were thinking. We run Python tests and exams with up to 500 students on an 8-cpu Jobe server and the Jobe load factor rarely gets over 2, at which point the server is only 25% busy. I would still be comfortable with a load factor that hit 10 - 12 provided it was short lived. So I doubt you need to change the Jobe configuration constants from the default values and in fact I fear you might actually be degrading the performance, unless you have more than 8 CPUs on your server.