Skip to content

PAPI

Design and Evaluation of a Standardized Interface for AMD GPU Performance Metrics

This work delivered a portable and low-overhead AMD GPU/APU monitoring interface for PAPI, designed to remain stable across device generations and ROCm software transitions.

Key Contributions

  • Portable, low-overhead, probe-based interface design for AMD GPUs: introduces a portable interface that abstracts vendor-specific API differences, keeps per-call overhead low, and uses probe-based runtime discovery across GPU generations and software releases.
  • Empirical validation of overhead, portability, and monitoring interval: establishes near-native per-call overhead relative to direct AMD SMI calls, confirms consistent access across evaluated GPU generations and ROCm releases, and identifies practical polling intervals.
  • Platform-scoped reference of supported metrics and limits: documents the supported metrics on the evaluated platform to enable reproducible and comparable studies.
  • Integration with PAPI: provides seamless incorporation into the PAPI framework, extending support to current AMD GPU/APUs while preserving continuity for existing users and tools.

Thesis

Objective: Modern high-performance computing systems increasingly rely on GPU accelerators, but differences and updates in vendor APIs make cross-platform measurement difficult and force users to maintain vendor-specific code. This thesis targets a robust and vendor-neutral AMD GPU/APU monitoring interface for performance analysis, energy optimization, and reliability studies.

Design: The implementation is built on AMD SMI and integrated into the PAPI component model. It abstracts vendor-specific API differences while maintaining low overhead, automatically detects which metrics are supported by a specific device and driver, and exposes only those metrics for power/energy, temperature, engine activity, and interconnect bandwidth/link state.

Validation: Experimental results confirm near-native measurement overhead and consistent behavior across AMD GPU models and ROCm versions, reducing maintenance burden and improving reproducibility for cross-platform performance analysis in heterogeneous HPC environments.

Key thesis results

Measured overhead and validation scale

1.009x and 32 metrics x 500 reads

Geometric mean call-time ratio (standardized interface / direct AMD SMI) using TOST with +/-2% bounds. Per-metric latency analysis covered fast, medium, and slower call categories.

Portability validation

MI210, MI250X, MI300A

Built, enumerated, and read supported metrics across ROCm 6.4.x, 7.0.x, 7.1.x, and 7.2.x.

Coverage expansion

342 vs 80 metrics

On MI300A with ROCm 7.0.1, AMD SMI exposed substantially more device-unique metrics than ROCm SMI.

Integration with PAPI

PAPI Integration

Seamless incorporation into the PAPI framework, extending functionality to current AMD GPU/APUs and ensuring continuity for existing users and tools.

AMD SMI Architecture

The design implements a portable, low-overhead abstraction layer over AMD SMI so that AMD GPU/APU metrics can be enumerated and sampled consistently through PAPI across hardware and ROCm versions.

  • Key design choices: probe-based runtime discovery validates metric availability before exposure, dynamic library loading with primary/fallback symbol binding handles API divergence, and bounded accessors keep the read path safe and low-latency.
  • Runtime event registration: at initialization, the component builds an event table from device probes so unsupported counters are never exposed for enumeration or sampling.
  • Concurrency and ownership: per-device ownership is enforced using a lock-protected device mask so concurrent eventsets in the same process do not sample the same GPU simultaneously.
  • Architecture and lifecycle: initialization loads and validates AMD SMI symbols, discovers devices, and builds events; measurement opens context, acquires device ownership, and executes accessor reads; shutdown releases context resources and unloads library state.
  • Module organization: core lifecycle state in amds.c, event mapping in amds_evtapi.c, context and locking in amds_ctx.c, metric read accessors in amds_accessors.c, and PAPI vector integration in linux-amd-smi.c.
PAPI AMD SMI architecture diagram from thesis page 28
PAPI AMD SMI architecture diagram from thesis page 28.

Global Equivalence Experiement

This experiment compares per-call read latency between the standardized interface and direct AMD SMI calls on MI300A with ROCm 7.0.1. For each metric and source, the measurement loop runs on a pinned CPU core and records 500 iterations.

Equivalence is evaluated with Two One-Sided Tests (TOST) using +/-2% bounds on log call-time ratios. The geometric-mean ratio is 1.009 with a 98% confidence interval of [0.999, 1.019], indicating near-native overhead for the standardized path.

Call-time comparison butterfly chart
Call-time comparison of the standardized interface versus direct AMD SMI across metrics. With 500 iterations, mirrored distributions remain nearly symmetric across fast and slower calls.
QuantityEstimate98% CIInterpretation
Geometric-mean ratio (SI/AMD SMI)1.009[0.999, 1.019]Within +/-2% bounds
Cross-metric mean (AMD SMI)371.70 us---Absolute reference
Cross-metric mean (SI)373.64 us---+0.52% vs SMI

Portability coverage by device and ROCm version

The evaluation verifies that the interface design and implementation handle differences in available metrics across GPU generations and ROCm releases by compiling, enumerating, and reading metrics on 2 MI210, 8 MI250X, and 4 MI300A systems under ROCm 6.4.x, 7.0.x, 7.1.x, and 7.2.x.

Only metrics supported by the specific device and ROCm release are counted. Metrics not supported on a given device or ROCm version are not exposed for enumeration or sampling. Portability is achieved through probe-based runtime discovery of metrics, exposing only supported counters.

DeviceROCm 6.4.xROCm 7.0.xROCm 7.1.xROCm 7.2.x
6.4.06.4.16.4.26.4.37.0.17.0.27.1.17.2.0
4 MI300A259259259259342341342342
8 MI250XN/A375N/AN/AN/A356N/AN/A
2 MI210345N/AN/AN/A333N/A333N/A

GEMM workload traces

After portability and metric enumeration checks, the thesis evaluates metric behavior under controlled GEMM workloads on MI300A, MI210, and MI250X systems. Sampled signals include GFX activity, UMC activity, power, edge/junction/PLX temperatures, and CU occupancy (where supported).

Across devices, traces preserve the expected temporal ordering of activity -> power -> temperature. The rocBLAS GEMM configuration (K = 131072) shows higher memory traffic and power than the reference GEMM configuration (K = 65536), consistent with the thesis analysis.

MI300A reference GEMM metrics trace
MI300A reference GEMM run (M=14592, K=65536, N=14592).
MI300A rocBLAS GEMM metrics trace
MI300A rocBLAS GEMM run (M=14592, K=131072, N=14592).
MI210 reference GEMM metrics trace
MI210 reference GEMM run (M=14592, K=65536, N=14592).
MI210 rocBLAS GEMM metrics trace
MI210 rocBLAS GEMM run (M=14592, K=131072, N=14592).
MI250X reference GEMM metrics trace
MI250X reference GEMM run (M=14592, K=65536, N=14592).
MI250X rocBLAS GEMM metrics trace
MI250X rocBLAS GEMM run (M=14592, K=131072, N=14592).

Monitoring Interval Experiement

This experiment examines how requested monitoring interval affects overhead and observable updates for power_current on MI300A (ROCm 7.0.1). Sampling intervals from 1 ms to 100 ms are tested during a 100 s run with repeated rest/load phases.

The table reports completed reads, per-call wall time, total time spent in reads, effective interval, and value changes. Results show the expected tradeoff: 1-2 ms intervals yield more distinct updates but higher sampling overhead, while longer intervals reduce overhead and capture fewer changes.

Interval (ms)ReadsMean +/- s.d. (us)Time in reads (s)Eff. interval (ms)Changes (count)
1100000589.33 +/- 193.6958.931.006753
250000785.27 +/- 176.5339.262.005854
520000792.63 +/- 186.4015.855.003644
1010000799.71 +/- 183.068.0010.002360
205000798.92 +/- 177.153.9920.001474
502000841.12 +/- 167.661.6850.00830
1001000854.48 +/- 168.040.85100.00457

References