Richard Lobb suggested that I posted here, so here goes…
First, a short background:
- I work as sysadmin at the Department of Computer Science at Lund University in Sweden
- We installed CodeRunner in 2017 to assist in our Java-courses (they are to be switched to Python staring this fall)
- In the spring of 2020, we started doing exams in CodeRunner and we managed to do all the major exams during the pandemic in Moodle/CodeRunner using LockDown Browser and Respondus Monitor (the student sat at home and used their own computers). Together, they made us confident in the validity of our exams
- As early as June 2020, we had some 350 student doing the Java exam at the same time
- Now, we have almost 500 students simultaneously in the big exams and next year we are planning for over 700 (in Python)
- Our system consists of one virtual Moodle-server (4 CPU, 16 GB RAM) and one JOBE-server (16 CPU and 8 GB of RAM), but running Ubuntu 20.04 LTS
- in-between the Moodle-server and the JOBE-server, we have a “vjobe” that is a nginx proxy in which we can configure another JOBE-server to assist (physical computer with 24 cores and loads or RAM) when we have a Big exam
- The Moodle-server is a docker based system and the Moodle-docker runs apache with mpm-prefork (more on that later)
- We have developed a “monitor system” (heading for GitHub) that allows us great freedom in monitoring of how the systems work. That system would be a *long* post in itself, but suffice it to say that have 45 “probes” (monitoring points of interest) on our moodle-server and 21 on the JOBE-server and we can plot them over time, assign rules that can cause email notifications and so on. Among those measurements are “Webbserver: request time (upper 99% confidence)”, “Webserver: free workers”, “Webbserver: #visitors (last minute)”, “CodeRunner: #compiles (last minute)”
- This monitoring system has been vital for tuning the system! Without it, it would have been like driving a car without an instrumentation panel and with the windows almost entirely blocked by ice
- Now we can follow, in almost real time (it updates once per minute) how the system works
- We have stress tested the system using locust (https://locust.io)
- For input, we used some 1,000 assignments from actual exams
Since we were to have such large exams, we increased the number of CPU:s on JOBE to 16 and almost crashed the June 2020 exam since we didn’t know how to properly increase the number us users. But we figured it out (during the exam – I do not recommend that experience!) and now we have 128 users dealing with the work load and thus far, it has worked like a charm.
A colleague of mine, also changed the JOBE server so that Java processes were locked to the same core, to avoid the performance penalty incurred when threads switch core. A pull request is in the pipe.
During the recent exam, we changed the prefork module so that it has the following values:
We also changed set the TimeOut in default-ssl.conf to 60 (it was 600).
We turned off the swap (but it should not really matter) and doubled the amount of both CPU and RAM for the machine in order to deal with the 455 students that actually came.
During the exam, I monitored the system using nmon (https://nmon.sourceforge.net/pmwiki.php) as well as our own system and everything went smooth as silk.
We have previously noticed that the number of apache workers have dropped to 0 during the initial login process (when the students log in at the start of the exam). In some cases, that has led to server outage for a few minutes, despite everything else being nominal. It always recovered within 5 minutes or so. We increased the number of workers to 300 and asked the teachers to spread the login of students and it has worked very well.