Argonne Leadership Computing Facility | ![]() |
an Office of Science user facility | ||
ALCF Public Data | |
Data Set: GEOPM | |
Source: Argonne Leadership Computing Facility |
This data was generated using the Global Extensible Open Power Manager which can be obtained at
GEOPM Project: see
github geopm
github geopm documentation
Trace files |
Each file contains time series data from a single node. |
File name format: dgemm_[agent]_[power_cap]_[iteration]_trace_[node_name] |
agent: "power_governor" or "power_balancer" |
power_cap: 125W to 215W |
iteration: experiment iteration |
node_name: supercomputer node identifier |
Start Time | The time that the GEOPM runtime started |
Profile | The power cap in watts used for the experiment. In general, this field is configurable by the user. |
Agent | The GEOPM agent used for the experiment. "power_governor" applies a fixed power cap to all nodes. "power_balancer" enforces an average power cap while improving performance by applying different power caps to each node. |
runtime | Average runtime in seconds across all ranks that entered the region. |
sync-runtime | Total time in seconds that all ranks were in the region at the same time. |
package-energy | Total energy in joules used by the processor package when all ranks were in the region. |
dram-energy | Total energy in joules used by the DRAM when all ranks were in the region. |
frequency (%) | Average processor frequency as a percentage of the sticker frequency (1.3GHz on Theta). |
frequency (Hz) | average processor frequency in hertz. |
mpi-runtime | total time spent in MPI calls within this region. |
count | the maximum number of times any rank entered this region. |
runtime | time interval in seconds from the beginning to the end of the run. |
package-energy | total energy in joules used by the processor package during the run. |
dram-energy | total energy in joules used by the DRAM suring the run. |
mpi-runtime | total time spent in MPI calls. |
ignore-time | total time spent in regions marked "ignore". There are none in these experiments. |
geopmctl memory HWM | maximum memory footprint of the GEOPM process on the node. |
geopmctl network BW (B/sec) | maximum network bandwidth used by the GEOPM process on the node. |
geopm_version | describes the version of GEOPM used to generate the data. |
start_time | the time that the GEOPM runtime started. |
profile_name | the power cap in watts used for the experiment. In general, this field is configurable by the user. |
node_name | the node for which the data was collected. |
agent | the GEOPM agent used for the experiment. |
time | time in seconds since the the application started. Note that multiple samples may be inserted at the same time to include samples reported by the application for region entry and exit. |
region_id | numerical region ID of the entire board for a given sample. The name of the region can be determined by matching the number with the region in the report. |
region_progress | progress from 0.0 to 1.0 within a region. A value of 1.0 indicates the application reported an exit from the region. |
region_runtime | last measured runtime for the current region. |
energy_package | sum of energy for all packages as reported by the MSR_PKG_ENERGY_STATUS register. |
energy_dram | sum of energy for all DRAM as reported by the MSR_DRAM_ENERGY_STATUS register. |
power_package | average package (socket) power over the last 8 samples within the current region. |
power_dram | average DRAM power over the last 8 samples within the current region. |
frequency | average of instantaneous processor frequency across all cores, as reported by the IA32_PERF_STATUS register. |
cycles_thread | value of CPU_CLK_Unhalted.Core (IA32_FIXED_CTR1). This can be used to calculate the actual achieved frequency by dividing the change in cycles_thread over a time period by the change in cycles_reference over the same time period. The achieved frequency in the report is calculated using this method. |
cycles_reference | value of CPU_CLK_Unhalted.Ref (IA32_FIXED_CTR2). |
temperature_core | average temperature across all cores for the node. |
power_budget | the power budget received by this node from its parent. After receiving the initial budget, it stays the same for the whole run. |
epoch_runtime | the time interval in seconds between the last two epoch (outer loop) calls by the application averaged over all ranks on the node and excluding time spent in MPI. |
power_limit | power limit assigned to the compute node. |
policy_power_cap | the latest power cap received. This will be 0 unless receiving a new power cap from the root agent. |
policy_step_count | the current value of the algorithm step counter. The current state is the step count modulo 3. |
policy_max_epoch_runtime | the maximum runtime across all nodes. |
policy_power_slack | the latest power slack value received from the parent. |
policy_power_limit | the actual power limit that was set on the node. It may be different from "power_limit" due to hardware constraints. |