From debbugs-submit-bounces@debbugs.gnu.org Wed Feb 02 18:13:47 2022 Received: (at 51536) by debbugs.gnu.org; 2 Feb 2022 23:13:47 +0000 Received: from localhost ([127.0.0.1]:53911 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1nFOos-0005Or-V6 for submit@debbugs.gnu.org; Wed, 02 Feb 2022 18:13:47 -0500 Received: from hera.aquilenet.fr ([185.233.100.1]:44726) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1nFOop-0005Oc-BV for 51536@debbugs.gnu.org; Wed, 02 Feb 2022 18:13:45 -0500 Received: from localhost (localhost [127.0.0.1]) by hera.aquilenet.fr (Postfix) with ESMTP id CE0A32A0; Thu, 3 Feb 2022 00:13:36 +0100 (CET) X-Virus-Scanned: Debian amavisd-new at aquilenet.fr Received: from hera.aquilenet.fr ([127.0.0.1]) by localhost (hera.aquilenet.fr [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 3xCW0pnxAqei; Thu, 3 Feb 2022 00:13:35 +0100 (CET) Received: from ribbon (unknown [IPv6:2a01:e0a:1d:7270:af76:b9b:ca24:c465]) by hera.aquilenet.fr (Postfix) with ESMTPSA id 51EEA289; Thu, 3 Feb 2022 00:13:34 +0100 (CET) From: =?utf-8?Q?Ludovic_Court=C3=A8s?= To: Maxim Cournoyer Subject: Re: bug#51536: openblas builds not reproducible on different x86_64 machines References: <87h7cw7ewb.fsf@gmail.com> Date: Thu, 03 Feb 2022 00:13:33 +0100 In-Reply-To: <87h7cw7ewb.fsf@gmail.com> (Maxim Cournoyer's message of "Sun, 31 Oct 2021 23:07:00 -0400") Message-ID: <87czk4rheq.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Spamd-Bar: / Authentication-Results: hera.aquilenet.fr; none X-Rspamd-Server: hera X-Rspamd-Queue-Id: CE0A32A0 X-Spamd-Result: default: False [-0.10 / 15.00]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; FREEMAIL_ENVRCPT(0.00)[gmail.com]; TO_MATCH_ENVRCPT_ALL(0.00)[]; TAGGED_RCPT(0.00)[]; MIME_GOOD(-0.10)[multipart/mixed,text/plain,text/x-patch]; TO_DN_SOME(0.00)[]; FREEMAIL_TO(0.00)[gmail.com]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:+,3:+]; RCVD_COUNT_TWO(0.00)[2]; RCVD_TLS_ALL(0.00)[]; MID_RHS_MATCH_FROM(0.00)[] X-Spam-Score: 1.0 (+) X-Debbugs-Envelope-To: 51536 Cc: Ricardo Wurmus , 51536@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Hi! Maxim Cournoyer skribis: > Our OpenBLAS package uses DYNAMIC_ARCH=3D1 to provide optimizations for > all supported targets, at least of x86 and x86_64. In theory that seems > OK, but in practice the builds differ depending on the host CPU. What follows is the log of an investigation that didn=E2=80=99t find the ro= ot cause, but perhaps it=E2=80=99ll give us ideas=E2=80=A6 Right now the build results of ci.guix and bordeaux.guix differ: --8<---------------cut here---------------start------------->8--- $ guix describe Generacio 202 Jan 30 2022 23:57:03 (nuna) guix 43dd34c repository URL: https://git.savannah.gnu.org/git/guix.git branch: master commit: 43dd34c7777a212c99a97da7a2c237158faa9a1b ludo@ribbon ~/src/guix$ guix challenge openblas /gnu/store/ras6dprsw3wm3swk23jjp8ww5dwxj333-openblas-0.3.18 contents differ: no local build for '/gnu/store/ras6dprsw3wm3swk23jjp8ww5dwxj333-openblas-= 0.3.18' https://ci.guix.gnu.org/nar/lzip/ras6dprsw3wm3swk23jjp8ww5dwxj333-openbla= s-0.3.18: 0m1jlc26yrwxn8gxwpj8452kw4g84ywclh0hnab93873ifz87s5c https://bordeaux.guix.gnu.org/nar/lzip/ras6dprsw3wm3swk23jjp8ww5dwxj333-o= penblas-0.3.18: 1d0m9v3kpsqzplpl1law2lfhm6rrbhkkqsvh19dlg9wx45vbbvjb differing file: /lib/libopenblasp-r0.3.18.so 1 store items were analyzed: - 0 (0.0%) were identical - 1 (100.0%) differed - 0 (0.0%) were inconclusive --8<---------------cut here---------------end--------------->8--- To get an idea, I thought we could compare the two build logs: https://ci.guix.gnu.org/log/ras6dprsw3wm3swk23jjp8ww5dwxj333-openblas-0.3= .18 https://bordeaux.guix.gnu.org/build/3fab433c-e7d3-498d-86f8-4bcd5da9c4db (Protip: I found the second one via .) The =E2=80=9Car -ru ../libopenblasp-r0.3.18.a =E2=80=A6=E2=80=9D are appar= ently the same in both cases, which rules out the simple case of unsorted .o files. The .so on ci.guix is slightly bigger: --8<---------------cut here---------------start------------->8--- $ wget -qO - https://ci.guix.gnu.org/nar/lzip/ras6dprsw3wm3swk23jjp8ww5dwxj= 333-openblas-0.3.18| lzip -d | guix archive -x /tmp/o1 $ wget -qO - https://bordeaux.guix.gnu.org/nar/lzip/ras6dprsw3wm3swk23jjp8w= w5dwxj333-openblas-0.3.18| lzip -d | guix archive -x /tmp/o2 $ ls -l /tmp/{o1,o2}/lib/libopenblasp-r0.3.18.so -r-xr-xr-x 1 ludo users 40538768 Jan 1 1970 /tmp/o1/lib/libopenblasp-r0.3= .18.so -r-xr-xr-x 1 ludo users 40436368 Jan 1 1970 /tmp/o2/lib/libopenblasp-r0.3= .18.so --8<---------------cut here---------------end--------------->8--- Both have the same symbols though, and in the same order: --8<---------------cut here---------------start------------->8--- $ diff -u <(objdump -T /tmp/o1/lib/libopenblasp-r0.3.18.so |cut -c 60- ) = <(objdump -T /tmp/o2/lib/libopenblasp-r0.3.18.so |cut -c60- ) $ echo $? 0 --8<---------------cut here---------------end--------------->8--- =E2=80=A6 which suggests they include code optimized for the same micro-architectures because symbols include the name of the micro-architecture: --8<---------------cut here---------------start------------->8--- $ objdump -T /tmp/o1/lib/libopenblasp-r0.3.18.so |cut -c 60-|tail -10 csymm3m_RU cgemv_c_BARCELONA csymv_U_HASWELL dtrmm_iltncopy_CORE2 LAPACKE_dsytrs2 openblas_num_threads_env csycon_rook_ csytri_rook_ --8<---------------cut here---------------end--------------->8--- Some of the offsets differ though: --=-=-= Content-Type: text/x-patch Content-Disposition: inline $ diff -u <(objdump -T /tmp/o1/lib/libopenblasp-r0.3.18.so ) <(objdump -T /tmp/o2/lib/libopenblasp-r0.3.18.so ) --- /dev/fd/63 2022-02-03 00:10:17.308357982 +0100 +++ /dev/fd/62 2022-02-03 00:10:17.276357923 +0100 @@ -1,5 +1,5 @@ -/tmp/o1/lib/libopenblasp-r0.3.18.so: format de fixer elf64-x86-64 +/tmp/o2/lib/libopenblasp-r0.3.18.so: format de fixer elf64-x86-64 DYNAMIC SYMBOL TABLE: 0000000000000000 DF *UND* 0000000000000000 GLIBC_2.3.2 pthread_cond_signal @@ -91,57 +91,57 @@ 00000000013edb70 g DF .text 00000000000001be Base zgemm3m_incopyb_BULLDOZER 0000000000e6d200 g DF .text 0000000000002b06 Base strsm_kernel_RT_BOBCAT 0000000000512c00 g DF .text 0000000000000a0a Base zsymv_U_PRESCOTT -00000000023c7530 g DF .text 0000000000000201 Base LAPACKE_dpttrs_work +00000000023ae930 g DF .text 0000000000000201 Base LAPACKE_dpttrs_work 0000000000692000 g DF .text 0000000000000b89 Base srot_k_PENRYN 000000000179caa0 g DF .text 0000000000000200 Base dgemm_beta_HASWELL 0000000000a44690 g DF .text 00000000000004b4 Base dtrsm_iutucopy_OPTERON -000000000231cfc0 g DF .text 000000000000021d Base LAPACKE_sstein_work -0000000002327800 g DF .text 000000000000014b Base LAPACKE_ssytrd -0000000001ad9100 g DF .text 00000000000002aa Base chemm_outcopy_SKYLAKEX +00000000023043c0 g DF .text 000000000000021d Base LAPACKE_sstein_work +000000000230ec00 g DF .text 000000000000014b Base LAPACKE_ssytrd +0000000001acc900 g DF .text 00000000000002aa Base chemm_outcopy_SKYLAKEX 00000000017d6c10 g DF .text 0000000000000c38 Base cgemv_n_HASWELL -0000000002327b70 g DF .text 0000000000000143 Base LAPACKE_ssytrf +000000000230ef70 g DF .text 0000000000000143 Base LAPACKE_ssytrf 000000000018f010 g DF .text 000000000000025c Base cblas_stbmv 0000000000195a20 g DF .text 000000000000003b Base cblas_idamin -0000000002328d40 g DF .text 0000000000000101 Base LAPACKE_ssytri +0000000002310140 g DF .text 0000000000000101 Base LAPACKE_ssytri 000000000077be00 g DF .text 0000000000000e65 Base ztrsm_kernel_RN_PENRYN 0000000001583f20 g DF .text 0000000000001c22 Base dtrmm_iltucopy_STEAMROLLER -00000000021bf830 g DF .text 0000000000000527 Base ztbcon_ -0000000001a70630 g DF .text 00000000000001c7 Base dsymm_oltcopy_SKYLAKEX -000000000245a910 g DF .text 000000000000001b Base LAPACKE_zpp_nancheck +00000000021a6c30 g DF .text 0000000000000527 Base ztbcon_ +0000000001a640c0 g DF .text 000000000000066d Base dsymm_oltcopy_SKYLAKEX +0000000002441d10 g DF .text 000000000000001b Base LAPACKE_zpp_nancheck 000000000108ee20 g DF .text 000000000000014d Base zgemm3m_oncopyb_ATOM -0000000002409df0 g DF .text 000000000000035c Base LAPACKE_zgtsvx_work -0000000001e7d120 g DF .text 0000000000001743 Base dlatrs_ -0000000001e948a0 g DF .text 00000000000001d1 Base drscl_ +00000000023f11f0 g DF .text 000000000000035c Base LAPACKE_zgtsvx_work +0000000001e64520 g DF .text 0000000000001743 Base dlatrs_ +0000000001e7bca0 g DF .text 00000000000001d1 Base drscl_ 00000000019ac700 g DF .text 00000000000004bd Base zhemm3m_iucopyb_ZEN 00000000003c0f30 g DF .text 000000000000001e Base support_avx512_bf16 -0000000002329ac0 g DF .text 0000000000000107 Base LAPACKE_ssytrs +0000000002310ec0 g DF .text 0000000000000107 Base LAPACKE_ssytrs 0000000000f94890 g DF .text 00000000000002d3 Base ztrmm_oltncopy_BOBCAT --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On #guix-hpc Ricardo mentioned encountering this reproducibility issue earlier. Ludo=E2=80=99. --=-=-=--