Skip to content

Avoid assigning meaning to arch_target_map keys #294

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
casparvl opened this issue Jan 31, 2025 · 1 comment
Open

Avoid assigning meaning to arch_target_map keys #294

casparvl opened this issue Jan 31, 2025 · 1 comment

Comments

@casparvl
Copy link
Contributor

Currently, the arch_target_map keys have meaning: they are interpreted by the bot/build.sh from software-layer to represent the OS/SUBDIR (but not accelerator). This is problematic, because if I have a system with e.g. zen4 CPU nodes and zen4+H100 GPU nodes, those would normally both be encoded as linux/x86_64/amd/zen4 in the architecture map - and clearly I cannot do that because keys have to be unique.

It would be better if the keys were meaningless, and if the bot/build.sh would get it's information elsewhere. For example:

arch_target_map = {
    'virtual_partition_1': {
        'os': 'linux',
        'subdir': 'x86_64/amd/zen4',
        'slurm_params': '-p genoa <etc>',
    },
    'virtual_partition_2': {
        'os': 'linux',
        'subdir': 'x86_64/amd/zen4',
        'accel': 'nvidia/cc90',
        'slurm_params': '-p gpu_h100 <etc>',
    },
}

would then configure the cpu-only zen4 partition and zen4+H100 partition respectively.

This would require changes in two places:

  1. The bot code, because the bot currently assumes that the value of the arch_target_map is the slurm parameters to be used for submission. That should change, and it should extract one level deeper, i.e. arch_target_map['some_partition']['slurm_params'] instead of arch_target_map['some_os_subdir'].
  2. The bot/build.sh should extract the relevant information from a more deeply nested dict.
@casparvl
Copy link
Contributor Author

casparvl commented Apr 9, 2025

Changes will be needed at least at around

for arch, slurm_opt in arch_map.items():

GOAL

Suppose we have

arch_target_map = {
    # zen4 CPU nodes
    'virtual_partition_1': {
        'os': 'linux',
        'subdir': 'x86_64/amd/zen4',
        'slurm_params': '-p genoa <etc>',
    },
    # Zen4 CPU + H100 GPU nodes
    'virtual_partition_2': {
        'os': 'linux',
        'subdir': 'x86_64/amd/zen4',
        'accel': 'nvidia/cc90',
        'slurm_params': '-p gpu_h100 <etc>',
    },
    # Icelake + A100 GPU nodes
    'virtual_partition_3': {
        'os': 'linux',
        'subdir': 'x86_64/intel/icelake',
        'accel': 'nvidia/cc80',
        'slurm_params': '-p gpu_a100 <etc>',
    },
    # Icelake + A100 GPU nodes, pretending to be Iclake CPU only
    'virtual_partition_4': {
        'os': 'linux',
        'subdir': 'x86_64/intel/icelake',
        'slurm_params': '-p gpu_a100 <etc>',
    },
}

Then we have the following four scenarios:

  1. Building for zen4 CPU:
bot: build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4

There is no accel defined. Thus, the bot should match this to the arch_target_map, and figure out this has to be build on virtual_partition_1, i.e. using the slurm_params as defined there. The build prefix is inferred from the bot build command, and will thus be x86_64/amd/zen4, as intended.

  1. Building for zen4+CC90:
bot: build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90

Bot matches this to virtual_partition_2 and uses those slurm_params upon submission. The prefix again will be determined based on the build command, and thus be x86_64/amd/zen4/nvidia/cc90 as intended.

  1. Building for icelake+CC80:
bot: build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:icelake accel:nvidia/cc80

Bot matches this to virtual_partition_3 and uses those slurm_params upon submission. The prefix again will be determined based on the build command, and thus be x86_64/intel/icelake/nvidia/cc80 as intended.

  1. Building for icelake:
bot: build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:icelake accel:nvidia/cc80

Bot matches this to virtual_partition_4 and uses those slurm_params upon submission. The prefix again will be determined based on the build command, and thus be x86_64/intel/icelake as intended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant