According to the ROCm official document (v5.4.3) "ROCm is a brand name for ROCm open software platform (for software) or the ROCm™ open platform ecosystem (includes hardware like FPGAs or other CPU architectures)."
In the scope of Gentoo distribution, "ROCm" refers to ROCm open software platform, currently supporting AMDGPU as its hardware.
ROCm itself aims for as an environment for heterogeneous computing, not limiting to AMDGPU. It is the current packaging strategy of Gentoo that ROCm only supports AMDGPU; if ROCm is needed for other vendors (typically the cuda backend of
sci-libs/hip-*packages), please file a bug to Gentoo Bugzilla
ROCm is somewhat analogous to CUDA in the sense of providing an API (HIPAMD) whose usage is in the same spirit: simplified kernels-based GPU programming. It also provides OpenCL and OpenMP and programming models. Note that ROCm is not the only way to run compute tasks on AMD GPUs as Mesa3D (media-libs/mesa) also provides this capability over its own OpenCL and Vulkan APIs.
Components of ROCm
- Drivers and runtimes, provided by the amdgpu kernel model and
- Programming models. See ROCm#Programming_models for details.
- Compilers and tools. Gentoo uses to vanilla clang (
- Libraries. Gentoo has packaged most libraries prefixed by
sci-libs/roc*packages are written in HIP and uses hipamd as backend, while
sci-libs/hip*are simple wrappers.
- Deployment tools. As a user of Gentoo, the best choice to deploy common ROCm components is via portage.
Supported AMD GPUs
ROCm targets discrete AMD GPUs and officially supports nearly any GFX9 and later GPU generations with a few exceptions . Basically this covers GCN-3 (Fiji chips only) to GCN-5 and all RDNA micro-architectures.
Kernel configuration options
The following kernel configuration options have be ticked:
Device Drivers ---> Graphics Support ---> <M> AMD GPU [ ] Enable amdgpu support for SI parts [ ] Enable amdgpu support for CIK parts -*- Always enable userptr write support ACP (Audio CoProcessor) Configuration ---> Display Engine Configuration ---> [*] HSA kernel driver for AMD GPU devices [*] Enable HMM-based shared virtual memory manager
You can also choose to not build the AMD GPU support as a kernel module and so make it being built-in.
Kernel command line parameters
The defaults options values for the AMDGPU kernel module are adequate and provide a stable configuration. Under some circumstances it is possible to override various options (see drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c in the Linux kernel source tree for a detailed description of those).
For example, setting
amdgpu.ppfeaturemask=0xffffffff gives full features in AMDGPU power play, which maybe useful when adjusting GPU power profiles via
Checking for a functional kernel configuration
Once your Linux kernel has been reconfigured, recompiled by a manner of your choice and the machine rebooted to use it, you can check for KFD messages in the Linux kernel log. If AMDKFD (AMD Fusion Kernel Driver) which is the HSA compute Linux kernel driver has been successfully enabled for your AMD GPU you should see something like:
dmesg | grep kfd
[ 9.324948] kfd kfd: amdgpu: Allocated 3969056 bytes on gart [ 9.376206] kfd kfd: amdgpu: added device 1002:73df
Using ROCm requires access to the character device /dev/kfd which is owned by
root:video. Thus any user wanting to use ROCm should be included in the video group. This is achieved via:
usermod -a -G video userlogin
This change will be effective at their next logon.
System monitoring tools
dev-util/rocm-smi provides various information about the underlying AMD GPU devices such as the die temperature, the current clock speeds and VRAM usage. To install it:
emerge --ask dev-util/rocm-smi
You can simply invoke it without any options to see a short summary:
======================= ROCm System Management Interface ======================= ================================= Concise Info ================================= GPU Temp (DieEdge) AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 0 45.0c 7.0W 500Mhz 96Mhz 0% auto 220.0W 16% 1% ================================================================================ ============================= End of ROCm SMI Log ==============================
You can invoke
rocm-smiwith the option
-ato view a complete report.
Another package of interest is dev-util/rocminfo which provides detailed information about currently registered HSA devices. To install it:
emerge --ask dev-util/rocminfo
Then, simply invoke it to see what is usable by ROCm:
ROCk module is loaded ===================== HSA System Attributes ===================== Runtime Version: 1.1 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE System Endianness: LITTLE ========== HSA Agents ========== ******* Agent 1 ******* Name: AMD Ryzen 9 7950X 16-Core Processor Uuid: CPU-XX Marketing Name: AMD Ryzen 9 7950X 16-Core Processor Vendor Name: CPU Feature: None specified Profile: FULL_PROFILE Float Round Mode: NEAR Max Queue Number: 0(0x0) Queue Min Size: 0(0x0) Queue Max Size: 0(0x0) Queue Type: MULTI Node: 0 Device Type: CPU Cache Info: L1: 32768(0x8000) KB Chip ID: 0(0x0) ASIC Revision: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 4500 BDFID: 0 Internal Node ID: 0 Compute Unit: 32 SIMDs per CU: 0 Shader Engines: 0 Shader Arrs. per Eng.: 0 WatchPts on Addr. Ranges:1 Features: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: FINE GRAINED Size: 131631704(0x7d88a58) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 2 Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 131631704(0x7d88a58) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 3 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 131631704(0x7d88a58) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: TRUE ISA Info: ******* Agent 2 ******* Name: gfx1031 Uuid: GPU-XX Marketing Name: AMD Radeon RX 6750 XT Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 1 Device Type: GPU Cache Info: L1: 16(0x10) KB L2: 3072(0xc00) KB L3: 98304(0x18000) KB Chip ID: 29663(0x73df) ASIC Revision: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 2880 BDFID: 768 Internal Node ID: 1 Compute Unit: 40 SIMDs per CU: 2 Shader Engines: 4 Shader Arrs. per Eng.: 2 WatchPts on Addr. Ranges:4 Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 32(0x20) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 32(0x20) Max Work-item Per CU: 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 12566528(0xbfc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx1031 Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32 *** Done ***
In the above example, the integrated GPU had been deactivated in the BIOS, thus only the CPU itself and the dGPU are usable.
Detailed information can be seen in OpenCL#AMD.
Detailed information can be seen in HIP.
Clang-15 is not functional as of mid-april 2023. See bug #904143.
Enabling OpenMP for Clang/LLVM
To enable OpenMP offloading on AMDGPU, ensure that :
- sys-devel/clang-runtime has been emerged with the USE flag
- sys-libs/libomp (LLVM/clang specific) has been emerged with the USE flags
This is achieved by:
- Checking if the portage configuration has
AMDGPUset as a target for the variable LLVM_TARGETS:
(...) LLVM_TARGETS="AMDGPU BPF X86" (...)
- Adding the required statements in /etc/portage/package.use either by adding a file under it (if /etc/portage/package.use) is a directory or an entry (if /etc/portage/package.use is a file) like this:
emerge --ask --deep --newuse sys-devel/clang-runtime
You can also set the
openmpUSE flag globally by adding it to the USE variable defined in /etc/portage/make.conf
Testing OpenMP support
clang -fno-stack-protector -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa --rocm-path=/opt/gentoo/usr --rocm-device-lib-path=/opt/gentoo/usr/lib/amdgcn/bitcode <openmp source code> -o <executable>
-fno-stack-protectoris needed because Gentoo has turned on
/etc/clang/gentoo-hardened.cfg, see llvm-project/issues/62066 and bug #890377.
If clang omits
error: no library 'libomptarget-amdgpu-gfx1031.bc' found in the default clang lib directory or in LIBRARY_PATH; use '--libomptarget-amdgpu-bc-path' to specify amdgpu bitcode library Then adding
--libomptarget-amdgpu-bc-path=/usr/lib64/ to the compile command resolve this issue.
If sys-devel/clang does not contain amdgpu-arch tool or the target GPU is not on the machine
error: cannot determine AMDGPU architecture: amdgpu-arch: Execute failed: Executable "amdgpu-arch" doesn't exist!.
In this case clang cannot detect gpu architecture automatically (or in cross compile, arch is not present on compile machine), so clang needs a GPU arch specifier script:
#!/bin/bash echo "gfx90a" # Change to the target to compile here, but do not append target features such as :xnack-
Make script executable:
chmod +x /tmp/print_gpu_arch.sh
And then add option
--amdgpu-arch-tool=/tmp/print_gpu_arch.sh to the compilation command.
The backend of ROCm is currently llvm/clang, so any programming model that can generate LLVM IR for AMDGPU can use ROCm. Numba is a jit compiler for python codes, and can offload to ROCm. Currently Gentoo does not packaged numba with ROCm yet.
Currently, Gentoo has packages
sci-libs category. Those are math and deep learning libraries written in HIP and runs on AMD GPUs.
Wrapper packages, are
hipBLAS (wrapper of
hipCUB (wrapper of
hipDNN is currently not packaged. It's a wrapper of
nccl) is collective communication routines for AMD GPUs. It can also run tests, but tests are only meaningful on multi GPU systems.
sci-libs/rocALUTION (targeting paralution) is currently in development.
Specifying architectures to compile
rocm.eclass (ROCm version >=5.1.3), Gentoo handles the
USE_EXPAND. The map between GPU and arch name can be viewed via checking use flag for ROCm libraries:
equery uses rocBLAS
* Found these USE flags for sci-libs/rocBLAS-5.4.2-r1: U I - - amdgpu_targets_gfx1010 : RDNA GPU, codename navi10, including Radeon RX 5700XT/5700/5700M/5700B/5700XTB/5600XT/5600/5600M, Radeon Pro 5700XT/5700, Radeon Pro W5700X/W5700 - - amdgpu_targets_gfx1011 : RDNA GPU, codename navi12, including Radeon Pro 5600M/V520 - - amdgpu_targets_gfx1012 : RDNA GPU, codename navi14, including Radeon RX 5500XT/5500/5500M/5500XTB/5300/5300M, Radeon Pro 5500XT/5500M/5300/5300M, Radeon Pro W5500X/W5500/W5500M/W5300M + - amdgpu_targets_gfx1030 : RDNA2 GPU, codename navi21/sienna cichlid, including Radeon RX 6950XT/6900XT/6800XT/6800, Radeon Pro W6800 - - amdgpu_targets_gfx1031 : RDNA2 GPU, codename navi22/navy flounder, including Radeon RX 6750XT/6700XT/6800M/6700M - - amdgpu_targets_gfx1100 : RDNA3 GPU, codename navi31/plum bonito, including Radeon RX 7900XTX/7900XT - - amdgpu_targets_gfx1101 : RDNA3 GPU, codename navi32 - - amdgpu_targets_gfx1102 : RDNA3 GPU, codename navi33 - - amdgpu_targets_gfx803 : Fiji GPU, codename fiji, including Radeon R9 Nano/Fury/FuryX, Radeon Pro Duo, FirePro S9300x2, Radeon Instinct MI8 - - amdgpu_targets_gfx900 : Vega GPU, codename vega10, including Radeon Vega Frontier Edition, Radeon RX Vega 56/64, Radeon RX Vega 64 Liquid, Radeon Pro Vega 48/56/64/64X, Radeon Pro WX 8200/9100, Radeon Pro V320/V340/SSG, Radeon Instinct MI25 + - amdgpu_targets_gfx906 : Vega GPU, codename vega20, including Radeon (Pro) VII, Radeon Instinct MI50/MI60 + - amdgpu_targets_gfx908 : CDNA Accelerator, codename arcturus, including AMD Instinct MI100 Accelerator + - amdgpu_targets_gfx90a : CDNA2 Accelerator, codename aldebaran, including AMD Instinct MI200 series Accelerators - - benchmark : Build and install rocblas-bench. - - doc : Add extra documentation (API, Javadoc, etc). It is recommended to enable per package instead of globally - - test : Perform rocblas-test to compare the result between rocBLAS and system BLAS.
By default, officially supported architectures (
gfx906 gfx908 gfx90a gfx1030) are turned on. For example, for a system with Radeon VII and RX 6700XT, specify GPU archs for all packages:
# disable gfx908, gfx90a, gfx1030; turn on gfx1031; gfx906 remains on */* AMDGPU_TARGETS: -gfx908 -gfx90a -gfx1030 gfx1031
Adjusting use flags for individual packages is also supported. Portage will take care of the dependencies: if
sci-libs/rocBLAS should turns on
gfx1031, or when portage will try to add it to
Upgrade to 5.1.3 or above from the legacy way
rocm.eclass (ROCm version <5.1.3), architectures are specified via environment variable
For users installing ROCm libraries using the legacy method (specifying
/etc/portage/make.conf), upgrading to 5.1.3 takes two steps:
/etc/portage/package.use/00-amdgpu-targetsmentioned in ROCm#Specifying_architectures_to_compile
XNACK target feature
NACK stands for negative-acknowledgement. It is an important feature in Heterogeneous Memory Management (HMM) . Simply put, in HMM GPU VRAM and CPU RAM shares the same pointer space, while xnack allows the GPU to retry memory accesses after a page fault when the desired data does not exist in GPU VRAM, making VRAM kind of a cache of CPU RAM for GPU devices. Some documents may help better understanding the mechanism and importance of xnack, how to enable it, and the performance impact:  
For GFX9 series GPUs, it is common to append xnack feature flag behind GPU architecture, for example
xnack- stands for xnack disabled, while
xnack+ is xnack enabled. This feature can be checked with
Contributing and developing guide
Testing ROCm libraries is not easy -- it requires recent AMD discrete GPUs and days of compilation and testing. If using ROCm libraries and mathematical correctness is considered important, please test the hardware by enabling tests:
Then emerge the desired ROCm package. If test failures occurs, usually it is caused by small inconsistencies between ROCm libraries and CPU reference implementations. Or it is caused by upstream bugs, or Gentoo deployment strategy. In either situation, filing a bug report to Gentoo Bugzilla is welcome, and it would be better to report to upstream for mathematical errors.