SLURM client version must match daemon version

  • Open
  • quality assurance status badge
Details
3 participants
  • Ludovic Courtès
  • Ludovic Courtès
  • Ricardo Wurmus
Owner
unassigned
Submitted by
Ludovic Courtès
Severity
normal
L
L
Ludovic Courtès wrote on 2 Nov 2020 10:10
(address . bug-guix@gnu.org)
87imaonmxs.fsf@inria.fr
Hello,

We’ve noticed the problem below on clusters running a foreign distro
when slurmd is version 19.x and our clients are version 20.x:

Toggle snippet (20 lines)
[courtes@devel01 ~]$ guix time-machine --commit=2f107f273de3db1d01bdec66b13334edef7ad036 -- package -A slurm
Mise à jour du canal « guix » depuis le dépôt Git « https://git.savannah.gnu.org/git/guix.git »...
python-slurm-magic 0.0-0.73dd1a2 out gnu/packages/parallel.scm:225:4
slurm 20.02.5 out gnu/packages/parallel.scm:109:2
slurm-drmaa 1.1.1 out gnu/packages/parallel.scm:194:2
[courtes@devel01 ~]$ guix time-machine --commit=2f107f273de3db1d01bdec66b13334edef7ad036 -- environment --ad-hoc slurm -- squeue
Mise à jour du canal « guix » depuis le dépôt Git « https://git.savannah.gnu.org/git/guix.git »...
slurm_load_jobs error: Zero Bytes were transmitted or received
[courtes@devel01 ~]$ guix time-machine --commit=09b00a62b297edb92ac4dde6f4838261ac0cad16 -- package -A slurm
Mise à jour du canal « guix » depuis le dépôt Git « https://git.savannah.gnu.org/git/guix.git »...
python-slurm-magic 0.0-0.73dd1a2 out gnu/packages/parallel.scm:225:4
slurm 19.05.3-2 out gnu/packages/parallel.scm:109:2
slurm-drmaa 1.1.1 out gnu/packages/parallel.scm:194:2
[courtes@devel01 ~]$ guix time-machine --commit=09b00a62b297edb92ac4dde6f4838261ac0cad16 -- environment --ad-hoc slurm -- squeue
Mise à jour du canal « guix » depuis le dépôt Git « https://git.savannah.gnu.org/git/guix.git »...
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
[courtes@devel01 ~]$ /usr/bin/squeue --version
slurm 19.05.2

It means that we cannot generally use the Guix-provided SLURM on
clusters running foreign distros.


Slurm daemons will support RPCs and state files from the two previous
major releases (e.g. a version 17.11.x SlurmDBD will support slurmctld
daemons and commands with a version of 17.11.x, 17.02.x or 16.05.x).

Looking at https://download.schedmd.com/slurm/, there’s been quite a
few releases between 19.05.3-2 and 20.02.5, which may explain the
problem I described.


Apparently the only .so in Open MPI linked against SLURM is
‘lib/openmpi/mca_pmix_s1.so’. The diff suggests that the two versions are
not ABI-compatible, so one wouldn’t be able to use ‘--with-graft’ to
graft one version in lieu of the other:

Toggle snippet (13 lines)
[courtes@devel01 ~]$ guix time-machine --commit=09b00a62b297edb92ac4dde6f4838261ac0cad16 -- build slurm
Mise à jour du canal « guix » depuis le dépôt Git « https://git.savannah.gnu.org/git/guix.git »...
/gnu/store/37b7qnwck4pg51qia4w002i62g156xgw-slurm-19.05.3-2
[courtes@devel01 ~]$ guix time-machine --commit=2f107f273de3db1d01bdec66b13334edef7ad036 -- build slurm
Mise à jour du canal « guix » depuis le dépôt Git « https://git.savannah.gnu.org/git/guix.git »...
/gnu/store/7n6aks2wcmn2pxv03q8ij38hsj9zfzk9-slurm-20.02.5
[courtes@devel01 ~]$ abidiff --stat /gnu/store/37b7qnwck4pg51qia4w002i62g156xgw-slurm-19.05.3-2/lib/slurm/libslurmfull.so /gnu/store/7n6aks2wcmn2pxv03q8ij38hsj9zfzk9-slurm-20.02.5/lib/slurm/libslurmfull.so
Functions changes summary: 0 Removed, 0 Changed, 0 Added function
Variables changes summary: 0 Removed, 0 Changed, 0 Added variable
Function symbols changes summary: 80 Removed, 162 Added function symbols not referenced by debug info
Variable symbols changes summary: 3 Removed, 0 Added variable symbols not referenced by debug info

What can we do about it?

At least, we should package several known-useful versions, so that
people can use ‘--with-input=slurm@X=slurm@Y’ (if needed) or explicitly
refer to the version they want in their profile. I’ll work on that.

Anything else?

I heard that PMIx, a scheduler-independent API, will eventually
supersede SLURM in Open MPI. Let’s see if that loosens version
requirements.

Thanks,
Ludo’.
L
L
Ludovic Courtès wrote on 2 Nov 2020 15:36
(address . 44387@debbugs.gnu.org)(name . Ricardo Wurmus)(address . rekado@elephly.net)
877dr3lta5.fsf@gnu.org
Ludovic Courtès <ludovic.courtes@inria.fr> skribis:

Toggle quote (4 lines)
> At least, we should package several known-useful versions, so that
> people can use ‘--with-input=slurm@X=slurm@Y’ (if needed) or explicitly
> refer to the version they want in their profile. I’ll work on that.

R
R
Ricardo Wurmus wrote on 2 Nov 2020 17:27
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 44387@debbugs.gnu.org)
87361rzpth.fsf@elephly.net
Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (10 lines)
> Ludovic Courtès <ludovic.courtes@inria.fr> skribis:
>
>> At least, we should package several known-useful versions, so that
>> people can use ‘--with-input=slurm@X=slurm@Y’ (if needed) or explicitly
>> refer to the version they want in their profile. I’ll work on that.
>
> I’ve reintroduced version 19.05:
>
> https://git.savannah.gnu.org/cgit/guix.git/commit/?id=e1bd62eb5ce0f2410b2607f157989588791b43e0

Good call. It seems like a good idea to keep older major versions
around.

There’s a similar problem with postgres, which needs (or used to need)
more than one version to upgrade existing data from an older version.

--
Ricardo
?
Your comment

Commenting via the web interface is currently disabled.

To comment on this conversation send an email to 44387@debbugs.gnu.org

To respond to this issue using the mumi CLI, first switch to it
mumi current 44387
Then, you may apply the latest patchset in this issue (with sign off)
mumi am -- -s
Or, compose a reply to this issue
mumi compose
Or, send patches to this issue
mumi send-email *.patch