From debbugs-submit-bounces@debbugs.gnu.org Mon Nov 02 04:11:05 2020 Received: (at submit) by debbugs.gnu.org; 2 Nov 2020 09:11:05 +0000 Received: from localhost ([127.0.0.1]:39743 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kZVrl-0001Gb-7J for submit@debbugs.gnu.org; Mon, 02 Nov 2020 04:11:05 -0500 Received: from lists.gnu.org ([209.51.188.17]:45160) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kZVrj-0001GQ-8R for submit@debbugs.gnu.org; Mon, 02 Nov 2020 04:11:03 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:41484) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kZVrj-0002Oy-3W for bug-guix@gnu.org; Mon, 02 Nov 2020 04:11:03 -0500 Received: from mail2-relais-roc.national.inria.fr ([192.134.164.83]:30713) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kZVrg-0001CY-Ir for bug-guix@gnu.org; Mon, 02 Nov 2020 04:11:02 -0500 X-IronPort-AV: E=Sophos;i="5.77,444,1596492000"; d="scan'208";a="475329382" Received: from 91-160-117-201.subs.proxad.net (HELO ribbon) ([91.160.117.201]) by mail2-relais-roc.national.inria.fr with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 02 Nov 2020 10:10:55 +0100 From: =?utf-8?Q?Ludovic_Court=C3=A8s?= To: Subject: SLURM client version must match daemon version X-Debbugs-CC: Ricardo Wurmus X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 12 Brumaire an 229 de la =?utf-8?Q?R=C3=A9volution?= X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Mon, 02 Nov 2020 10:10:55 +0100 Message-ID: <87imaonmxs.fsf@inria.fr> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=192.134.164.83; envelope-from=ludovic.courtes@inria.fr; helo=mail2-relais-roc.national.inria.fr X-detected-operating-system: by eggs.gnu.org: First seen = 2020/11/02 02:33:22 X-ACL-Warn: Detected OS = ??? X-Spam_score_int: -68 X-Spam_score: -6.9 X-Spam_bar: ------ X-Spam_report: (-6.9 / 5.0 requ) BAYES_00=-1.9, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-Spam-Score: -1.3 (-) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) Hello, We=E2=80=99ve noticed the problem below on clusters running a foreign distro when slurmd is version 19.x and our clients are version 20.x: --8<---------------cut here---------------start------------->8--- [courtes@devel01 ~]$ guix time-machine --commit=3D2f107f273de3db1d01bdec66b= 13334edef7ad036 -- package -A slurm Mise =C3=A0 jour du canal =C2=AB guix =C2=BB depuis le d=C3=A9p=C3=B4t Git = =C2=AB https://git.savannah.gnu.org/git/guix.git =C2=BB... python-slurm-magic 0.0-0.73dd1a2 out gnu/packages/parallel.scm:2= 25:4 slurm 20.02.5 out gnu/packages/parallel.scm:109:2 slurm-drmaa 1.1.1 out gnu/packages/parallel.scm:194:2 [courtes@devel01 ~]$ guix time-machine --commit=3D2f107f273de3db1d01bdec66b= 13334edef7ad036 -- environment --ad-hoc slurm -- squeue Mise =C3=A0 jour du canal =C2=AB guix =C2=BB depuis le d=C3=A9p=C3=B4t Git = =C2=AB https://git.savannah.gnu.org/git/guix.git =C2=BB... slurm_load_jobs error: Zero Bytes were transmitted or received [courtes@devel01 ~]$ guix time-machine --commit=3D09b00a62b297edb92ac4dde6f= 4838261ac0cad16 -- package -A slurm Mise =C3=A0 jour du canal =C2=AB guix =C2=BB depuis le d=C3=A9p=C3=B4t Git = =C2=AB https://git.savannah.gnu.org/git/guix.git =C2=BB... python-slurm-magic 0.0-0.73dd1a2 out gnu/packages/parallel.scm:2= 25:4 slurm 19.05.3-2 out gnu/packages/parallel.scm:109:2 slurm-drmaa 1.1.1 out gnu/packages/parallel.scm:194:2 [courtes@devel01 ~]$ guix time-machine --commit=3D09b00a62b297edb92ac4dde6f= 4838261ac0cad16 -- environment --ad-hoc slurm -- squeue Mise =C3=A0 jour du canal =C2=AB guix =C2=BB depuis le d=C3=A9p=C3=B4t Git = =C2=AB https://git.savannah.gnu.org/git/guix.git =C2=BB... JOBID PARTITION NAME USER ST TIME NODES NODELIS= T(REASON) [courtes@devel01 ~]$ /usr/bin/squeue --version slurm 19.05.2 --8<---------------cut here---------------end--------------->8--- It means that we cannot generally use the Guix-provided SLURM on clusters running foreign distros. reads: Slurm daemons will support RPCs and state files from the two previous major releases (e.g. a version 17.11.x SlurmDBD will support slurmctld daemons and commands with a version of 17.11.x, 17.02.x or 16.05.x). Looking at , there=E2=80=99s been quit= e a few releases between 19.05.3-2 and 20.02.5, which may explain the problem I described. Apparently the only .so in Open=C2=A0MPI linked against SLURM is =E2=80=98lib/openmpi/mca_pmix_s1.so=E2=80=99. The diff suggests that the t= wo versions are not ABI-compatible, so one wouldn=E2=80=99t be able to use =E2=80=98--with-= graft=E2=80=99 to graft one version in lieu of the other: --8<---------------cut here---------------start------------->8--- [courtes@devel01 ~]$ guix time-machine --commit=3D09b00a62b297edb92ac4dde6f= 4838261ac0cad16 -- build slurm Mise =C3=A0 jour du canal =C2=AB guix =C2=BB depuis le d=C3=A9p=C3=B4t Git = =C2=AB https://git.savannah.gnu.org/git/guix.git =C2=BB... /gnu/store/37b7qnwck4pg51qia4w002i62g156xgw-slurm-19.05.3-2 [courtes@devel01 ~]$ guix time-machine --commit=3D2f107f273de3db1d01bdec66b= 13334edef7ad036 -- build slurm Mise =C3=A0 jour du canal =C2=AB guix =C2=BB depuis le d=C3=A9p=C3=B4t Git = =C2=AB https://git.savannah.gnu.org/git/guix.git =C2=BB... /gnu/store/7n6aks2wcmn2pxv03q8ij38hsj9zfzk9-slurm-20.02.5 [courtes@devel01 ~]$ abidiff --stat /gnu/store/37b7qnwck4pg51qia4w002i62g15= 6xgw-slurm-19.05.3-2/lib/slurm/libslurmfull.so /gnu/store/7n6aks2wcmn2pxv03= q8ij38hsj9zfzk9-slurm-20.02.5/lib/slurm/libslurmfull.so Functions changes summary: 0 Removed, 0 Changed, 0 Added function Variables changes summary: 0 Removed, 0 Changed, 0 Added variable Function symbols changes summary: 80 Removed, 162 Added function symbols no= t referenced by debug info Variable symbols changes summary: 3 Removed, 0 Added variable symbols not r= eferenced by debug info --8<---------------cut here---------------end--------------->8--- What can we do about it? At least, we should package several known-useful versions, so that people can use =E2=80=98--with-input=3Dslurm@X=3Dslurm@Y=E2=80=99 (if neede= d) or explicitly refer to the version they want in their profile. I=E2=80=99ll work on that. Anything else? I heard that PMIx, a scheduler-independent API, will eventually supersede SLURM in Open=C2=A0MPI. Let=E2=80=99s see if that loosens version requirements. Thanks, Ludo=E2=80=99.