Skip to content

Add hook for TensorFlow v2.18.1#177

Open
TopRichard wants to merge 8 commits intoEESSI:mainfrom
TopRichard:TensorFlow-v2-18-1
Open

Add hook for TensorFlow v2.18.1#177
TopRichard wants to merge 8 commits intoEESSI:mainfrom
TopRichard:TensorFlow-v2-18-1

Conversation

@TopRichard
Copy link
Collaborator

@TopRichard TopRichard commented Mar 10, 2026

Replacing h5py with a version that provides wheels for both x86 and ARM resolves the initial GLIBC error. This could be applied as a patch, but in this PR it is implemented in a parse hook as a proof of concept.

Results for x86_64 builds:

Executed 844 out of 844 tests: 844 tests pass.
There were tests whose specified size is too big. Use the --test_verbose_timeout_warnings command line option to see which ones these are.

KleidiAI in TF 2.18 has a similar -march/-mcpu conflict as XNNPACK, the easyblock already excludes XNNPACK from -mcpu=native for aarch64, extending the same exclusion to KleidiAI, this change can also be added to the easyblock.

Results for aarch64 builds:

Executed 844 out of 844 tests: 844 tests pass.
There were tests whose specified size is too big. Use the --test_verbose_timeout_warnings command line option to see which ones these are.

see #171

@trz42
Copy link
Contributor

trz42 commented Mar 10, 2026

Looks good to me. Is the same fix needed for the CUDA version? Anyhow, you have to trigger the build for one architecture, I guess it only applies to a single EESSI version? Ah, yeah #171 notes this. Maybe good to mention this in the title or description too.

@TopRichard
Copy link
Collaborator Author

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=x86_64/amd/zen2

@eessi-bot-aws
Copy link

eessi-bot-aws bot commented Mar 10, 2026

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: amd-zen2
Building for: x86_64/amd/zen2
Job dir: /project/def-users/SHARED/jobs/2026.03/pr_177/138186

date job status comment
Mar 10 12:56:14 UTC 2026 submitted job id 138186 awaits release by job manager
Mar 10 12:57:14 UTC 2026 released job awaits launch by Slurm scheduler
Mar 10 12:58:16 UTC 2026 running job 138186 is running
Mar 10 12:59:17 UTC 2026 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-138186.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-x86_64-amd-zen2-17731474630.tar.zstsize: 0 MiB (27296 bytes)
entries: 1
modules under 2025.06/software/linux/x86_64/amd/zen2/modules/all
no module files in tarball
software under 2025.06/software/linux/x86_64/amd/zen2/software
no software packages in tarball
reprod directories under 2025.06/software/linux/x86_64/amd/zen2/reprod
no reprod directories in tarball
other under 2025.06/software/linux/x86_64/amd/zen2
2025.06/init/easybuild/eb_hooks.py
Mar 10 12:59:17 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:x86-64-zen2+default
P: latency: 1.52 us (r:0, l:None, u:None)
[ OK ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:x86-64-zen2+default
P: latency: 2.04 us (r:0, l:None, u:None)
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:x86-64-zen2+default
P: latency: 0.18 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:x86-64-zen2+default
P: bandwidth: 7997.57 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-138186.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Contributor

bedroge commented Mar 10, 2026

Note that @casparvl is currently deploying #170, which also changes the hooks file, so this probably needs a sync+rebuild after that one is ingested/merged.

@TopRichard
Copy link
Collaborator Author

Note that @casparvl is currently deploying #170, which also changes the hooks file, so this probably needs a sync+rebuild after that one is ingested/merged.

We should pause modifications to the hooks file as soon as we start building TensorFlow, that could take some time and we do not want to rebuild successful builds

Copy link
Contributor

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be wrong, but it seems like at least some part of this could/should be done in the TensorFlow easyblock or the easyconfig file being used, especially since at first sight they don't seem specific to EESSI at all?

ec['preconfigopts'] = ec.get('preconfigopts', '') + (
'export GCC_HOST_COMPILER_PATH=$EBROOTGCC/bin/gcc && '
'sed -i \'s|--define=PREFIX=/usr|--define=PREFIX=\\$EESSI_EPREFIX|g\' .bazelrc && '
'cat > /tmp/fix_h5py.py << \'EOF\'\n'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the use of a hardcoded path in a fixed location...

Can we at least use /tmp/$USER/, or even better mktemp -d ?

Some comments in this hook to explain exactly what all of this does and why it's required would be helpful I think, since it looks quite involved...

'with open("requirements_lock_3_12.txt", "r") as f:\n'
' content = f.read()\n'
'content = content.replace("h5py==3.11.0 \\\\", "h5py==3.15.1 \\\\")\n'
'content = content.replace(\n'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use re.sub here instead, especially below to avoid the long repetition.

Something like this (totally untested):

regex = re.compile("(--hash=sha256:f4e025e852754ca833401777c25888acb96889ee2c27e7e629a19aee288833f0)", re.M)
extra_hashes = "...'
content = regex.sub(extra_hashes, content)

if get_eessi_envvar('EESSI_CPU_FAMILY') == 'aarch64':
ec['prebuildopts'] = ec.get('prebuildopts', '') + (
# KleidiAI in TF 2.18 has similar -march/-mcpu conflict as XNNPACK
# The easyblock already excludes XNNPACK from -mcpu=native, extend the same exclusion to KleidiAI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this something that should be fixed in the TensorFlow easyblock, or in the easyconfig file for TensorFlow 2.18.1?

Now that I think of it, same question w.r.t. the hash stuff above...
Would that be a lot cleaner by adding a patch file (seems like that could work, since it's done pre-configure)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants