SLURM client version must match daemon version

OpenSubmitted by Ludovic Courtès.
Details
3 participants
  • Ludovic Courtès
  • Ludovic Courtès
  • Ricardo Wurmus
Owner
unassigned
Severity
normal
L
L
Ludovic Courtès wrote on 2 Nov 2020 10:10
(address . bug-guix@gnu.org)
87imaonmxs.fsf@inria.fr
Hello,
We’ve noticed the problem below on clusters running a foreign distrowhen slurmd is version 19.x and our clients are version 20.x:
Toggle snippet (20 lines)[courtes@devel01 ~]$ guix time-machine --commit=2f107f273de3db1d01bdec66b13334edef7ad036 -- package -A slurmMise à jour du canal « guix » depuis le dépôt Git « https://git.savannah.gnu.org/git/guix.git »...python-slurm-magic 0.0-0.73dd1a2 out gnu/packages/parallel.scm:225:4slurm 20.02.5 out gnu/packages/parallel.scm:109:2slurm-drmaa 1.1.1 out gnu/packages/parallel.scm:194:2[courtes@devel01 ~]$ guix time-machine --commit=2f107f273de3db1d01bdec66b13334edef7ad036 -- environment --ad-hoc slurm -- squeueMise à jour du canal « guix » depuis le dépôt Git « https://git.savannah.gnu.org/git/guix.git »...slurm_load_jobs error: Zero Bytes were transmitted or received[courtes@devel01 ~]$ guix time-machine --commit=09b00a62b297edb92ac4dde6f4838261ac0cad16 -- package -A slurmMise à jour du canal « guix » depuis le dépôt Git « https://git.savannah.gnu.org/git/guix.git »...python-slurm-magic 0.0-0.73dd1a2 out gnu/packages/parallel.scm:225:4slurm 19.05.3-2 out gnu/packages/parallel.scm:109:2slurm-drmaa 1.1.1 out gnu/packages/parallel.scm:194:2[courtes@devel01 ~]$ guix time-machine --commit=09b00a62b297edb92ac4dde6f4838261ac0cad16 -- environment --ad-hoc slurm -- squeueMise à jour du canal « guix » depuis le dépôt Git « https://git.savannah.gnu.org/git/guix.git »... JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)[courtes@devel01 ~]$ /usr/bin/squeue --versionslurm 19.05.2
It means that we cannot generally use the Guix-provided SLURM onclusters running foreign distros.
https://slurm.schedmd.com/troubleshoot.html#network reads:
Slurm daemons will support RPCs and state files from the two previous major releases (e.g. a version 17.11.x SlurmDBD will support slurmctld daemons and commands with a version of 17.11.x, 17.02.x or 16.05.x).
Looking at https://download.schedmd.com/slurm/, there’s been quite afew releases between 19.05.3-2 and 20.02.5, which may explain theproblem I described.

Apparently the only .so in Open┬áMPI linked against SLURM is‘lib/openmpi/mca_pmix_s1.so’. The diff suggests that the two versions arenot ABI-compatible, so one wouldn’t be able to use ‘--with-graft’ tograft one version in lieu of the other:
Toggle snippet (13 lines)[courtes@devel01 ~]$ guix time-machine --commit=09b00a62b297edb92ac4dde6f4838261ac0cad16 -- build slurmMise à jour du canal « guix » depuis le dépôt Git « https://git.savannah.gnu.org/git/guix.git ».../gnu/store/37b7qnwck4pg51qia4w002i62g156xgw-slurm-19.05.3-2[courtes@devel01 ~]$ guix time-machine --commit=2f107f273de3db1d01bdec66b13334edef7ad036 -- build slurmMise à jour du canal « guix » depuis le dépôt Git « https://git.savannah.gnu.org/git/guix.git ».../gnu/store/7n6aks2wcmn2pxv03q8ij38hsj9zfzk9-slurm-20.02.5[courtes@devel01 ~]$ abidiff --stat /gnu/store/37b7qnwck4pg51qia4w002i62g156xgw-slurm-19.05.3-2/lib/slurm/libslurmfull.so /gnu/store/7n6aks2wcmn2pxv03q8ij38hsj9zfzk9-slurm-20.02.5/lib/slurm/libslurmfull.soFunctions changes summary: 0 Removed, 0 Changed, 0 Added functionVariables changes summary: 0 Removed, 0 Changed, 0 Added variableFunction symbols changes summary: 80 Removed, 162 Added function symbols not referenced by debug infoVariable symbols changes summary: 3 Removed, 0 Added variable symbols not referenced by debug info
What can we do about it?
At least, we should package several known-useful versions, so thatpeople can use ‘--with-input=slurm@X=slurm@Y’ (if needed) or explicitlyrefer to the version they want in their profile. I’ll work on that.
Anything else?
I heard that PMIx, a scheduler-independent API, will eventuallysupersede SLURM in Open┬áMPI. Let’s see if that loosens versionrequirements.
Thanks,Ludo’.
L
L
Ludovic Courtès wrote on 2 Nov 2020 15:36
(address . 44387@debbugs.gnu.org)(name . Ricardo Wurmus)(address . rekado@elephly.net)
877dr3lta5.fsf@gnu.org
Ludovic Courtès <ludovic.courtes@inria.fr> skribis:
Toggle quote (4 lines)> At least, we should package several known-useful versions, so that> people can use ‘--with-input=slurm@X=slurm@Y’ (if needed) or explicitly> refer to the version they want in their profile. I’ll work on that.
R
R
Ricardo Wurmus wrote on 2 Nov 2020 17:27
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 44387@debbugs.gnu.org)
87361rzpth.fsf@elephly.net
Ludovic Courtès <ludo@gnu.org> writes:
Toggle quote (10 lines)> Ludovic Courtès <ludovic.courtes@inria.fr> skribis:>>> At least, we should package several known-useful versions, so that>> people can use ‘--with-input=slurm@X=slurm@Y’ (if needed) or explicitly>> refer to the version they want in their profile. I’ll work on that.>> I’ve reintroduced version 19.05:>> https://git.savannah.gnu.org/cgit/guix.git/commit/?id=e1bd62eb5ce0f2410b2607f157989588791b43e0
Good call. It seems like a good idea to keep older major versionsaround.
There’s a similar problem with postgres, which needs (or used to need)more than one version to upgrade existing data from an older version.
-- Ricardo
?
Your comment

Commenting via the web interface is currently disabled.

To comment on this conversation send email to 44387@debbugs.gnu.org