Pandas with CodeRunner/Jobe

Pandas with CodeRunner/Jobe

by Nicolas Dunand -
Number of replies: 16
Hello,

I recently reinstalled a whole jobe server from scratch and I have the same kind of issue as Miki in https://coderunner.org.nz/mod/forum/discuss.php?d=670, in this case with pandas.

First I got the same error as Miki where pip complains that "This environment is externally managed", but then installing via "apt-get install python3-pandas" makes pandas available in a Python shell (calling python3, then trying "import pandas" for instance), but not in CodeRunner, where I get the following error :

***Run error***
Traceback (most recent call last):
File "__tester__.python3", line 1, in import pandas as pd
ModuleNotFoundError: No module named 'pandas'

Any ideas ? Sorry I couldn't find and answer anywhere about this.
In reply to Nicolas Dunand

Re: How TO FIX :" import numpy as np, ModuleNotFoundError: No module named 'numpy'

by Mike McDowell -

Sounds like pandas wasn't installed globally on the jobe server.

sudo -H pip install PACKAGE_NAME

In reply to Mike McDowell

Re: How TO FIX :" import numpy as np, ModuleNotFoundError: No module named 'numpy'

by Nicolas Dunand -
Thanks Mike,

However when I do that I get the same error as Miki above: "This environment is externally managed [...] To install Python packages system-wide, try apt install python3-xyz, where xyz is the package you are trying to install."
In reply to Nicolas Dunand

Re: How TO FIX :" import numpy as np, ModuleNotFoundError: No module named 'numpy'

by Richard Lobb -

Have you tried

sudo apt-get intall python3-pandas

?

That seems to work ok with python 3.11.

In reply to Richard Lobb

Re: How TO FIX :" import numpy as np, ModuleNotFoundError: No module named 'numpy'

by Nicolas Dunand -
Thanks Richard,

Yes I have tried this, with no success. When I open a shell on the jobe server I can run python3, then import pandas without error, but the sandbox runs from within a question answer in Moodle qtype_coderunner yield the ModuleNotFoundError.

I'll keep investigating and report back.
In reply to Nicolas Dunand

Re: How TO FIX :" import numpy as np, ModuleNotFoundError: No module named 'numpy'

by Richard Lobb -

This suggests that you either have two versions of Python installed or have installed pandas in a virtual environment.

What's the output from:

(a) When running Python from a shell on Jobe:

import sys, pandas
print(sys.executable)
print(sys.path)
print(pandas.__file__)

(b) from a Python run via CodeRunner:

import sys
print(sys.executable)
print(sys.path)
In reply to Richard Lobb

Re: How TO FIX :" import numpy as np, ModuleNotFoundError: No module named 'numpy'

by Nicolas Dunand -
Thanks Richard:

(a) results in :

>>> import sys, pandas
>>> print(sys.executable)
/usr/bin/python3
>>> print(sys.path)
['', '/usr/lib/python311.zip', '/usr/lib/python3.11', '/usr/lib/python3.11/lib-dynload', '/usr/local/lib/python3.11/dist-packages', '/usr/lib/python3/dist-packages', '/usr/lib/python3.11/dist-packages']
>>> print(pandas.__file__)
/usr/lib/python3/dist-packages/pandas/__init__.py


(b) results in :

/usr/bin/python3
['/home/jobe/runs/jobe_tgGuzC', '/usr/lib/python311.zip', '/usr/lib/python3.11', '/usr/lib/python3.11/lib-dynload', '/usr/local/lib/python3.11/dist-packages', '/usr/lib/python3/dist-packages', '/usr/lib/python3.11/dist-packages']


No virtualenv as far as I can see.

If I try the import pandas in a CodeRunner run, I now get the error "***Time limit exceeded***".
In reply to Nicolas Dunand

Re: How TO FIX :" import numpy as np, ModuleNotFoundError: No module named 'numpy'

by Richard Lobb -
Thanks. I don't see anything suspicious in there, but it seems like the ModuleNotFound problem which we were trying to resolve is now fixed. But now you have a different problem: a time limit.

I just tried writing a simple python3 question type that asked for a function nums that returns a simple pandas series of fixed integers. And it worked fine even without raising the memory limit (which I'd have expected would be necessary).

Is the time limit exceeded message occurring even with a one-line program "import pandas as pd"?

I don't have an immediate explanation. But I would certainly try customising your question, and setting the sandbox parameters in the Advanced customisation section as follows:

screenshot

The memlimit of 0 isn't safe for production use as it turns off memory limit checks altogether, but it's useful for debugging.

Let me know if that helps.



In reply to Richard Lobb

Re: How TO FIX :" import numpy as np, ModuleNotFoundError: No module named 'numpy'

by Nicolas Dunand -
Thanks Richard.

Indeed the ModuleNotFound error is fixed somehow and this is a different problem.

The time limit error appears even by simply giving 'import pandas as pd' as an answer. The error is displayed after about 12 seconds.

I tried with the advanced custom parameters you suggest, and indeed after playing with them a bit, the MemLimit param appears to be the significant one, the other parameters having no noticeable influence. Having the MemLimit param set to zero solves the problem, but any other value (I tried up to 1024) still results in having the "Time limit exceeded" error.
In reply to Nicolas Dunand

Re: How TO FIX :" import numpy as np, ModuleNotFoundError: No module named 'numpy'

by Richard Lobb -
There is something very odd about your Jobe server configuration.

With some help from my AI mate Claude, here's a Python script for you to run via CodeRunner [**EDIT** Code changed to fix bug in reporting of resource limits and to avoid the use of the psutil module]. Please report its output:

#!/usr/bin/env python3
import os
import platform
import subprocess
import sys

def get_memory_info():
    """Get memory information from /proc/meminfo"""
    try:
        with open('/proc/meminfo') as f:
            meminfo = {}
            for line in f:
                key, value = line.split(':')
                # Convert kb to bytes and remove ' kB' suffix
                value = int(value.strip().split()[0]) * 1024
                meminfo[key] = value
            
            total_gb = round(meminfo['MemTotal'] / (1024**3), 2)
            available_gb = round((meminfo.get('MemAvailable', meminfo['MemFree'])) / (1024**3), 2)
            used_percent = round(((meminfo['MemTotal'] - meminfo.get('MemAvailable', meminfo['MemFree'])) / 
                                meminfo['MemTotal']) * 100, 2)
            
            return total_gb, available_gb, used_percent
    except:
        return None, None, None

def get_cpu_info():
    """Get CPU information from /proc/cpuinfo"""
    try:
        with open('/proc/cpuinfo') as f:
            cpuinfo = f.read()
            
        physical_ids = set()
        cpu_cores = set()
        
        for line in cpuinfo.split('\n'):
            if line.startswith('physical id'):
                physical_ids.add(line.split(':')[1].strip())
            elif line.startswith('processor'):
                cpu_cores.add(line.split(':')[1].strip())
                
        return len(physical_ids) or 1, len(cpu_cores)
    except:
        return None, None

def get_disk_usage(path='/'):
    """Get disk usage information"""
    try:
        st = os.statvfs(path)
        total = st.f_blocks * st.f_frsize
        free = st.f_bfree * st.f_frsize
        return round(total / (1024**3), 2), round(free / (1024**3), 2)
    except:
        return None, None

def get_system_info():
    info = {}
    
    # Basic system info
    info['python_version'] = sys.version
    info['platform'] = platform.platform()
    info['architecture'] = platform.machine()
    info['processor'] = platform.processor()
    
    # Memory information
    total_mem, available_mem, mem_percent = get_memory_info()
    if total_mem is not None:
        info['total_memory_gb'] = total_mem
        info['available_memory_gb'] = available_mem
        info['memory_percent_used'] = mem_percent
    
    # CPU information
    physical_cpus, logical_cpus = get_cpu_info()
    if physical_cpus is not None:
        info['cpu_count_physical'] = physical_cpus
        info['cpu_count_logical'] = logical_cpus
    
    # Linux-specific information
    if platform.system() == 'Linux':
        try:
            # Get Linux distribution details
            with open('/etc/os-release') as f:
                for line in f:
                    if line.startswith('PRETTY_NAME='):
                        info['linux_distribution'] = line.split('=')[1].strip().strip('"')
                        break
            
            # Check SELinux status
            try:
                selinux_status = subprocess.check_output(['getenforce'], text=True).strip()
                info['selinux_status'] = selinux_status
            except (subprocess.CalledProcessError, FileNotFoundError):
                info['selinux_status'] = 'Not installed or not accessible'
            
            # Get available storage
            root_total, root_free = get_disk_usage('/')
            if root_total is not None:
                info['root_total_gb'] = root_total
                info['root_free_gb'] = root_free
            
            # Check for common limiting factors
            # Max processes
            try:
                with open('/proc/sys/kernel/pid_max') as f:
                    info['max_processes'] = f.read().strip()
            except:
                info['max_processes'] = 'Unable to determine'
            
            # Resource limits
            try:
                ulimit_output = subprocess.check_output(['bash', '-c', 'ulimit -a'], 
                                                      shell=False, text=True)
                info['resource_limits'] = ulimit_output.strip()
            except:
                info['resource_limits'] = 'Unable to determine'
                
            # Check for containerization
            try:
                with open('/proc/1/cgroup') as f:
                    cgroup_content = f.read()
                    info['containerized'] = ('docker' in cgroup_content.lower() or 
                                           'lxc' in cgroup_content.lower())
            except:
                info['containerized'] = 'Unable to determine'
                
        except Exception as e:
            info['linux_specific_error'] = str(e)
            
    return info

def format_output(info):
    output = "=== System Diagnostics ===\n\n"
    
    # Format the output nicely
    for key, value in info.items():
        if key == 'resource_limits':
            output += f"\n=== Resource Limits ===\n{value}\n"
        else:
            output += f"{key.replace('_', ' ').title()}: {value}\n"
    
    return output

if __name__ == "__main__":
    try:
        info = get_system_info()
        print(format_output(info))
    except Exception as e:
        print(f"Error gathering system information: {e}")
In reply to Richard Lobb

Re: How TO FIX :" import numpy as np, ModuleNotFoundError: No module named 'numpy'

by Nicolas Dunand -
Thanks Richard, here's the result:

(the system is a simple Debian 12 (bookworm) with Jobe installed as per 
https://github.com/trampgeek/jobe?tab=readme-ov-file#installation . We also have installed a few extra packages (since the system wouldn't let us pip install python extensions) such as: 

python3-matplotlib python3-numpy python3-geopandas python3-pandas python3-cartopy python3-pyshp


=== System Diagnostics ===

    Python Version: 3.11.2 (main, Sep 14 2024, 03:00:30) [GCC 12.2.0]
    Platform: Linux-6.1.0-28-amd64-x86_64-with-glibc2.36
    Architecture: x86_64
    Processor:
    Total Memory Gb: 15.62
    Available Memory Gb: 14.81
    Memory Percent Used: 5.1
    Cpu Count Physical: 8
    Cpu Count Logical: 8
    Linux Distribution: Debian GNU/Linux 12 (bookworm)
    Selinux Status: Not installed or not accessible
    Root Total Gb: 11.09
    Root Free Gb: 4.93
    Max Processes: 4194304

    === Resource Limits ===
    40000
    Containerized: False
In reply to Nicolas Dunand

Re: How TO FIX :" import numpy as np, ModuleNotFoundError: No module named 'numpy'

by Richard Lobb -

Thanks Nicolas.

That mostly looks perfectly normal, except that the call to platform.processor() seems to return the empty string. On the machines I've tried, it yields 'x86_64'. But your architecture is still 'x86_64' so I doubt this difference is significant. The Resource Limits output isn't helpful because of a bug in the code (the ulimit_output assignment should be ulimit_output = subprocess.check_output(['bash', '-c', 'ulimit -a'], shell=False, text=True) but I doubt it's worth fixing that.

I scrolled back through your previous postings and I see you said you'd tried MemLimit values up to 1024. But actually, that's just 1GB which is more-or-less the default limit on a Python run. So you could try a value of say 2000 or even 4000 for the MemLimit value to see if that solves your problem. Note that this value sets the ulimit memory for the task, which is a virtual memory limit and many languages like to pre-allocate lots of virtual memory, most of which they never use. So you often need much higher values than you'd expect.

However, I just did some checks with our Jobe server and with the out-of-the-box Python3 question type, the line import pandas as pd works OK with a memory limit  of 500 MB. With 400 MB I get a Memory Error crash and at 300 MB a timeout error. Somehow, your Jobe server seems to be requiring at least two or three times as much (virtual) memory as ours for the same job, and I have no idea why, sorry.

Richard

In reply to Richard Lobb

Re: How TO FIX :" import numpy as np, ModuleNotFoundError: No module named 'numpy'

by Nicolas Dunand -
Thanks Richard,

Indeed maybe I don't understand well enough how Pyhton works with memory, and 1GB seemed high a limit to me. In fact, when I push up the limit further, at about 1.5GB the time limit error message disappears and the code executes properly.

I'm not sure how I should tackle this in a production environment : should I specify a highter limit per question for those which use for instance Pandas or other memory hungry modules?

In any case, thanks for all your help
In reply to Nicolas Dunand

Re: How TO FIX :" import numpy as np, ModuleNotFoundError: No module named 'numpy'

by Richard Lobb -
I don't really understand Python's memory manager, either. But it's important to realise that the memory limit you're setting is not a limit on physical memory use but on virtual memory only. Python allocates space for heap and stack, and numpy uses OpenBLAS which allocates space for each core (on the assumption it will be using all CPUs).

If you're curious to see how much actual physical memory is in use, you could run the following via CodeRunner:

def get_memory_usage_linux():
    with open("/proc/self/status") as f:
        status = f.readlines()

    memory_info = {}
    for line in status:
        if "VmRSS" in line:  # Resident Set Size (physical memory)
            memory_info["rss"] = int(line.split()[1])  # In kilobytes
        if "VmSize" in line:  # Virtual Memory Size
            memory_info["vms"] = int(line.split()[1])  # In kilobytes

    print(f"Physical memory (RSS): {memory_info['rss'] / 1024:.2f} MB")
    print(f"Virtual memory (VMS): {memory_info['vms'] / 1024:.2f} MB")

get_memory_usage_linux()
import pandas as pd
print()
get_memory_usage_linux()

On our Jobe servers, the output is:

Physical memory (RSS): 9.88 MB
Virtual memory (VMS): 16.24 MB

Physical memory (RSS): 69.41 MB
Virtual memory (VMS): 415.36 MB

At any rate, for production use I think you'd be quite safe using 2GB, or even 3GB. The limit is only there to protect against rogue runaway programs gobbling all the memory and bringing the Jobe server to its knees, at least until the job times out. Normal programs are unlikely to hit the limit. And even if they do, they'll get thrown out well before physical memory is exhausted - you've got 16GB, I see.

In reply to Richard Lobb

Re: How TO FIX :" import numpy as np, ModuleNotFoundError: No module named 'numpy'

by Nicolas Dunand -
Thanks, here is the memory output:

Physical memory (RSS): 8.83 MB
Virtual memory (VMS): 16.69 MB

Physical memory (RSS): 65.57 MB
Virtual memory (VMS): 1144.73 MB
This is a fresh install of Jobe (Version: 2.0.3, 2 November 2024) on Linux Debian, having added some extra python packages, such as python3-pandas, etc.

What is weird is that when I run this on a similarly configured Docker 'jobeinabox', the output is much closer to what you have:

Physical memory (RSS): 8.91 MB
Virtual memory (VMS): 12.62 MB

Physical memory (RSS): 60.99 MB
Virtual memory (VMS): 370.71 MB

In any case, is it possible then to raise the default memory limit to 2GB or so?

Another route I could try to investigate would be a fresh install of Jobe on a new server.

My use case: I'm still aiming at classes of about 200 students hitting on CodeRunner at more or less the same time frame via Moodle, so I want to make sure I won't be too limited.


In reply to Nicolas Dunand

Re: How TO FIX :" import numpy as np, ModuleNotFoundError: No module named 'numpy'

by Richard Lobb -

[You'll see I've split this discussion off to a new thread, since it's mostly pandas-specific].

I don't wish to raise the default memory limit to 2GB since most users don't use pandas and even among pandas users, most don't have your problem. I suggest you set the limit you want in whatever prototype you're using. Better still, create a new prototype python3_pandas for your course, with that limit in it.

It would be interesting to know why your memory demands are so much higher. I suspect numpy rather than pandas. You could test with the following addition, prior to importing numpy and/or pandas:

import os
os.environ["OMP_NUM_THREADS"] = "4"

That limits the number of threads for which numpy will allocate memory. The behaviour you're observing is consistent with running on a machine with considerably more than the 8 cores that the config-checker program I gave you is reporting.

In reply to Richard Lobb

Re: How TO FIX :" import numpy as np, ModuleNotFoundError: No module named 'numpy'

by Nicolas Dunand -

Thanks for splitting, I agree.

Thanks for the suggestion, I'll raise the memory limit in in a dedicated prototype for Numpy or Pandas.

I confirm it is Numpy and not Pandas itself gobbling up all the memory, I used an adapted version of your script and the results are :

Physical memory (RSS): 8.94 MB
Virtual memory (VMS): 16.69 MB

after immport numpy
Physical memory (RSS): 33.32 MB
Virtual memory (VMS): 1036.91 MB

after import pandas
Physical memory (RSS): 65.62 MB
Virtual memory (VMS): 1144.77 MB

I indeed have 8 cores (vCPUs on a VM). Then trying with

import os
os.environ["OMP_NUM_THREADS"] = "4"

before calling the get_memory_usage_linux() yields:

Physical memory (RSS): 8.85 MB
Virtual memory (VMS): 16.69 MB

after immport numpy
Physical memory (RSS): 31.12 MB
Virtual memory (VMS): 493.02 MB

after import pandas
Physical memory (RSS): 65.72 MB
Virtual memory (VMS): 568.85 MB

The server is a Debian 12.8 (6.1.0-28-amd64), whereas the jobeinabox Docker (which has a more "normal" behaviour) uses Ubuntu ...