python-pytorch is not reproducible
(address . bug-guix@gnu.org)
Bad news!
Toggle snippet (20 lines)
$ guix challenge python-pytorch
/gnu/store/dgdswx4vvf07xmhih21n4fnr68dh3fhd-python-pytorch-1.9.0 contents differ:
no local build for '/gnu/store/dgdswx4vvf07xmhih21n4fnr68dh3fhd-python-pytorch-1.9.0'
https://ci.guix.gnu.org/nar/lzip/dgdswx4vvf07xmhih21n4fnr68dh3fhd-python-pytorch-1.9.0: 0i55iwy3z4da4lhn93dnrmz775s9ga5kyfli6cmrchacacf9xfpq
https://bordeaux.guix.gnu.org/nar/lzip/dgdswx4vvf07xmhih21n4fnr68dh3fhd-python-pytorch-1.9.0: 1fl2v4pd0gcw7wp5k662q0zd4lvvzsggcm5ii8b4kq4v6synhkic
differing file:
/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so
1 store items were analyzed:
- 0 (0.0%) were identical
- 1 (100.0%) differed
- 0 (0.0%) were inconclusive
$ guix describe
Generacio 189 Aug 30 2021 12:09:27 (nuna)
guix f91ae94
repository URL: https://git.savannah.gnu.org/git/guix.git
branch: master
commit: f91ae9425bb385b60396a544afe27933896b8fa3
The file is 165 MiB and Diffoscope (which reads the output of ‘objdump’)
takes forever on it.
However, by comparing the output of ‘strings’ on each file, we get a
hint:
diff -ubBr --show-c-function /tmp/str2 /tmp/str1
--- /tmp/str2 2021-09-19 11:14:47.806798779 +0200
+++ /tmp/str1 2021-09-19 11:14:41.962761127 +0200
@@ -1100584,472 +1100584,472 @@ compute_fast_convolution_input_gradient
compute_grad_kernel_transform
compute_fast_convolution_kernel_gradient.isra.0
compute_fast_convolution_output
-nnp_fft8x8_with_offset_and_stream__avx2.__local0
-nnp_fft8x8_with_offset_and_stream__avx2.__local13
-nnp_fft8x8_with_offset_and_stream__avx2.__local18
-nnp_fft8x8_with_offset_and_stream__avx2.__local1
+nnp_fft8x8_with_offset_and_stream__avx2.__local5
nnp_fft8x8_with_offset_and_stream__avx2.__local16
+nnp_fft8x8_with_offset_and_stream__avx2.__local6
+nnp_fft8x8_with_offset_and_stream__avx2.__local11
+nnp_fft8x8_with_offset_and_stream__avx2.__local0
nnp_fft8x8_with_offset_and_stream__avx2.__local2
nnp_fft8x8_with_offset_and_stream__avx2.__local7
-nnp_fft8x8_with_offset_and_stream__avx2.__local17
-nnp_fft8x8_with_offset_and_stream__avx2.__local10
-nnp_fft8x8_with_offset_and_stream__avx2.__local8
nnp_fft8x8_with_offset_and_stream__avx2.__local15
+nnp_fft8x8_with_offset_and_stream__avx2.__local8
nnp_fft8x8_with_offset_and_stream__avx2.__local3
-nnp_fft8x8_with_offset_and_stream__avx2.__local6
-nnp_fft8x8_with_offset_and_stream__avx2.__local14
-nnp_fft8x8_with_offset_and_stream__avx2.__local9
+nnp_fft8x8_with_offset_and_stream__avx2.__local1
nnp_fft8x8_with_offset_and_stream__avx2.__local4
[…]
nnp_shdotxf8__avx2.__local13
-nnp_shdotxf8__avx2.__local15
nnp_shdotxf8__avx2.__local0
+nnp_shdotxf8__avx2.__local9
+nnp_shdotxf8__avx2.__local10
+nnp_shdotxf8__avx2.__local11
+nnp_shdotxf8__avx2.__local12
+nnp_shdotxf8__avx2.__local2
This appears to come from NNPACK, one of the libraries that are still
bundled. These functions seem to be generated by Python scripts that
use PeachPy, such as NNPACK/src/x86_64-fma/2d-fourier-8x8.py:
Toggle snippet (8 lines)
for post_operation in ["stream", "store"]:
fft8x8_arguments = (arg_t_pointer, arg_f_pointer, arg_t_stride, arg_f_stride, arg_row_count, arg_column_count, arg_row_offset, arg_column_offset)
with Function("nnp_fft8x8_with_offset_and_{post_operation}__avx2".format(post_operation=post_operation),
fft8x8_arguments, target=uarch.default + isa.fma3 + isa.avx2):
[…]
The ‘__local’ bit in the name comes from PeachPy, in peachpy/name.py:
Toggle snippet (8 lines)
suffixed_name = "__local" + str(suffix)
for name_object in iter(unnamed_objects):
# Generate a non-conflicting name by appending a suffix
while suffixed_name in self.names:
suffix += 1
suffixed_name = "__local" + str(suffix)
So the problem may be that these things get generated in parallel, and
thus numbering is non-deterministic.
NNPACK/CMakeLists.txt has this bit to generate targets to build all
that:
Toggle snippet (10 lines)
ADD_CUSTOM_COMMAND(
OUTPUT ${obj}
COMMAND "PYTHONPATH=${PEACHPY_PYTHONPATH}"
${PYTHON_EXECUTABLE} -m peachpy.x86_64
-mabi=sysv -g4 -mimage-format=${PEACHPY_IMAGE_FORMAT}
"-I${PROJECT_SOURCE_DIR}/src" "-I${PROJECT_SOURCE_DIR}/src/x86_64-fma" "-I${FP16_SOURCE_DIR}/include"
-o ${obj} "${PROJECT_SOURCE_DIR}/${src}"
DEPENDS ${NNPACK_BACKEND_PEACHPY_OBJS})
It might be that building just those targets sequentially would solve
the problem.
To be continued…
Ludo’.