From nobody Sun Dec 14 13:49:50 2025 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 65FD3194A67 for ; Thu, 16 Jan 2025 06:44:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009885; cv=none; b=I26FtqDh/nu4gwkvuI4A3tDAhLrrsNYcmU7jG7GrCDLQwn+3GQOleELiYQ9fIrUQSmSgI8CGVPuQwLlxZGd3mBLk8DWveo2WqBzJpwQU86vnm9WSLzu0oqbOw0B2CPLCkR49o8MMIqwqTCAV3mC2LfZQ8Wy96HAfFSNz7Or974I= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009885; c=relaxed/simple; bh=Sc+mHLlu9PgM3civ4ms76ry78SZIHvI8bG/ZEqaW9FM=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=Eqvimct86yBvMD70MtLrQWQcvuNZ8m9mDkqxcwnwZXSmOLX/CCokh/+c0adp45MleFBdDPcP7kfrZN8AxmTefYCFSqmdMuX3Esk3Mb5tEy3wlJyLwEnmzCbQovz4pLYBWIRCzr1eWAWb9g6PMfr82CdYaag9NJLXe81veMA1+3A= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=rl0caav+; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="rl0caav+" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-e3a1bc0c875so1653827276.2 for ; Wed, 15 Jan 2025 22:44:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737009846; x=1737614646; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=hyTcICFYQ+0Tf/U7sZ4e2kXGyGON8j3JHGlG9HfVNYw=; b=rl0caav+MjS9kg4SQy8n98jUtv99SB0o8sLxbmP/R1/0RTHBTpmRvhpMM0S0KR4X5T fpuAG4nirtKEY9twEiHgW9Y0XTRY5fhizGHAxsd3QdHX4aRmq+uoLxgWwNVk0gJfmDrr 3FdcOkAQ2vT56xVPrF6Qw2jd42+JuMX4hkxLV45uY9P8rJUm9t+yR962sSvovmsGJQSC IQ8hoxPB2ofg8q1Ex9AHobY8VWSl7dqlZtRiGVKVNHjHhHc43pkgGkVPg2d8/fPJ0+6k SyXEcNu/czY8T/66C3A73Wl/4BX7esTzP51gJrYODLVhtV6UZ9sqQZHyklirCwWecNku 90OA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737009846; x=1737614646; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=hyTcICFYQ+0Tf/U7sZ4e2kXGyGON8j3JHGlG9HfVNYw=; b=Lc4KALWwnnIoh1NMrBfzPCGkhq1e7zGB9hAL7pVU9Uy41HlIv/WGzND+XLioIKK4Kg wlCOeVio3AoKV615wrUMtCMINccvc9dIsMWFUGoIf+GORME0uOk3mVFT2DblRJnDsuvO ffvI+GIOXX3G9RcYV9gTK/6i+R1cBP+oAWzGVipDNNtLSdXJler5wGieTjliVefuHu/G u4tVx0PQ2Jm1K5D6bEMUX6vy/QTdiB9jGxOwteXI5aSqxpK1ro6cTL4uj32yOywsF0le XBFjlZJqhAgY8uSScrjnRE/B1Uv10w4X1DwfNsJBvxESUAWdc39jdgf192/+6MktRgMp /O6w== X-Forwarded-Encrypted: i=1; AJvYcCU+P82VUSgcl0AxXbOEdCvzwrJ+eCn0M1BMrpJJTiCP+XQ9LhNA1qkz3I+WRjdhuOHF0HMg+iEl1RzCPaw=@vger.kernel.org X-Gm-Message-State: AOJu0YxWTI0vaaJFIWyYNmIxUkPApZUgBVQCg+U9aBhgHhGSCHi9r+BA I/4aM/0E3vJKsVucCw4O9i/8ZDbIiUyUYvcWHda59a+llgpKNR33EyYSs/V7DqFyrJJ3/EEmWgF c3+4fmQ== X-Google-Smtp-Source: AGHT+IE7QFgtqs7Pdmjo9czk80IF2qHLHgvMJ0kX9Mq02XB2XhjeMsSmIAyrxq0t9kqNkOqOWiL74bGv//+E X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:7bd:ce19:c701:bfa3]) (user=irogers job=sendgmr) by 2002:a81:b511:0:b0:6ef:70fb:45b0 with SMTP id 00721157ae682-6f530f20637mr626567b3.0.1737009844814; Wed, 15 Jan 2025 22:44:04 -0800 (PST) Date: Wed, 15 Jan 2025 22:43:33 -0800 In-Reply-To: <20250116064355.344245-1-irogers@google.com> Message-Id: <20250116064355.344245-2-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250116064355.344245-1-irogers@google.com> X-Mailer: git-send-email 2.48.0.rc2.279.g1de40edade-goog Subject: [PATCH v2 01/23] perf vendor events: Update Alderlake events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.27 to v1.28. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v1.28: https://github.com/intel/perfmon/commit/801f43f22ec6bd23fbb5d18860f395d61e7= f4081 The TMA 5.01 update is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-authored-by: Caleb Biggers Signed-off-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../arch/x86/alderlake/adl-metrics.json | 3637 +++++++++++++---- .../pmu-events/arch/x86/alderlake/cache.json | 292 +- .../arch/x86/alderlake/floating-point.json | 19 +- .../arch/x86/alderlake/frontend.json | 19 - .../pmu-events/arch/x86/alderlake/memory.json | 32 +- .../arch/x86/alderlake/metricgroups.json | 10 +- .../pmu-events/arch/x86/alderlake/other.json | 92 +- .../arch/x86/alderlake/pipeline.json | 127 +- .../arch/x86/alderlake/virtual-memory.json | 33 + tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 10 files changed, 3361 insertions(+), 902 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json b/to= ols/perf/pmu-events/arch/x86/alderlake/adl-metrics.json index 8fdf4a4225de..c4ac43433b1f 100644 --- a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json +++ b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json @@ -48,13 +48,6 @@ "MetricName": "C7_Core_Residency", "ScaleUnit": "100%" }, - { - "BriefDescription": "C7 residency percent per package", - "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Pkg_Residency", - "ScaleUnit": "100%" - }, { "BriefDescription": "C8 residency percent per package", "MetricExpr": "cstate_pkg@c8\\-residency@ / TSC", @@ -62,13 +55,6 @@ "MetricName": "C8_Pkg_Residency", "ScaleUnit": "100%" }, - { - "BriefDescription": "C9 residency percent per package", - "MetricExpr": "cstate_pkg@c9\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C9_Pkg_Residency", - "ScaleUnit": "100%" - }, { "BriefDescription": "Percentage of cycles spent in System Manageme= nt Interrupts.", "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0= else 0)", @@ -112,56 +98,221 @@ "MetricName": "tsx_transactional_cycles", "ScaleUnit": "100%" }, + { + "BriefDescription": "Uncore frequency per die [GHZ]", + "MetricExpr": "tma_info_system_socket_clks / #num_dies / duration_= time / 1e9", + "MetricGroup": "SoC", + "MetricName": "UNCORE_FREQ", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to certain allocation restrictions", - "MetricExpr": "tma_core_bound", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS@ / (5 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_allocation_restriction", - "MetricThreshold": "tma_allocation_restriction > 0.1 & (tma_core_b= ound > 0.1 & tma_backend_bound > 0.1)", + "MetricThreshold": "(tma_allocation_restriction >0.10) & ((tma_cor= e_bound >0.10) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + = UOPS_DISPATCHED.PORT_5_11 + UOPS_DISPATCHED.PORT_6) / (5 * tma_info_core_co= re_clks)", + "MetricGroup": "Core_Execution;TopdownL5;tma_L5_group;tma_ports_ut= ilized_3m_group", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.4", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops delivered by the Microcode_Sequencer as a result of Assists= ", + "MetricExpr": "78 * ASSISTS.ANY / tma_info_thread_slots", + "MetricGroup": "BvIO;Slots_Estimated;TopdownL4;tma_L4_group;tma_mi= crocode_sequencer_group", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists", + "MetricExpr": "63 * ASSISTS.SSE_AVX_MIX / tma_info_thread_slots", + "MetricGroup": "HPC;Slots_Estimated;TopdownL5;tma_L5_group;tma_ass= ists_group", + "MetricName": "tma_avx_assists", + "MetricThreshold": "tma_avx_assists > 0.1", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend due to backend stalls", + "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALL@ / (5 * cpu_atom@CPU_= CLK_UNHALTED.CORE@)", + "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", - "MetricThreshold": "tma_backend_bound > 0.1", + "MetricThreshold": "tma_backend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend due to backend stalls. Note that uops must= be available for consumption in order for this event to count. If a uop is= not available (IQ is empty), this event will not count", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend because allocation is stalled due to a misp= redicted jump or a machine clear", + "BriefDescription": "This category represents fraction of slots wa= sted due to incorrect speculations", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "(5 * cpu_atom@CPU_CLK_UNHALTED.CORE@ - (cpu_atom@TO= PDOWN_FE_BOUND.ALL@ + cpu_atom@TOPDOWN_BE_BOUND.ALL@ + cpu_atom@TOPDOWN_RET= IRING.ALL@)) / (5 * cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + t= ma_retiring), 0)", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend because allocation is stalled due to a mis= predicted jump or a machine clear. Only issue slots wasted due to fast nuke= s such as memory ordering nukes are counted. Other nukes are not accounted = for. Counts all issue slots blocked during this recovery window including r= elevant microcode flows and while uops are not yet available in the instruc= tion queue (IQ). Also includes the issue slots that were consumed by the ba= ckend but were thrown away because they were younger than the mispredict or= machine clear.", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%", "Unit": "cpu_atom" }, + { + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Default;Fed;Frontend;IcMiss;Memo= ryTLB;Scaled_Slots;TopdownL1;tma_L1_group", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Default;Ret;Scaled_Slots;TopdownL1;tma_L1_gro= up", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_fb_full)= ))", + "MetricGroup": "BvMB;Default;Mem;MemoryBW;Offcore;Scaled_Slots;Top= downL1;tma_L1_group;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_= bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_d= ram_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtlb_load + tma_st= ore_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_load= s + tma_fb_full)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_= l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_split_l= oads / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma= _lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bound * (tma_s= tore_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_split_stores / (tma_store_latency + tma_false_sha= ring + tma_split_stores + tma_streaming_stores + tma_dtlb_store)) + tma_mem= ory_bound * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound = + tma_dram_bound + tma_store_bound)) * (tma_store_latency / (tma_store_late= ncy + tma_false_sharing + tma_split_stores + tma_streaming_stores + tma_dtl= b_store)))", + "MetricGroup": "BvML;Default;Mem;MemoryLat;Offcore;Scaled_Slots;To= pdownL1;tma_L1_group;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;Default;Scaled_Slots;TopdownL1;tma_L1_gro= up;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D0x1@) * (t= ma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_res= teers + tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispr= edicts) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_bra= nches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma_= ms_switches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms / = (tma_mite + tma_dsb + tma_lsd + tma_ms))) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Default;Fed;FetchBW;Frontend;Scaled_Slots;Top= downL1;tma_L1_group", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu_core@U= OPS_RETIRED.MS\\,cmask\\=3D0x1@) * (tma_fetch_latency * (tma_ms_switches + = tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma= _other_mispredicts / tma_branch_mispredicts) / (tma_mispredicts_resteers + = tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_itl= b_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switch= es) + tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_lsd + tma_ms= )) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / t= ma_other_nukes + tma_core_bound * (tma_serializing_operation + RS.EMPTY_RES= OURCE / tma_info_thread_clks * tma_ports_utilized_0) / (tma_divider + tma_s= erializing_operation + tma_ports_utilization) + tma_microcode_sequencer / (= tma_few_uops_instructions + tma_microcode_sequencer) * (tma_assists / tma_m= icrocode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Default;Ret;Scaled_Slots;TopdownL1;tm= a_L1_group;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_bound += tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_dt= lb_store / (tma_store_latency + tma_false_sharing + tma_split_stores + tma_= streaming_stores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Default;Mem;MemoryTLB;Offcore;Scaled_Slots;To= pdownL1;tma_L1_group;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;Default;LockCont;Mem;Offcore;Scaled_Slots;Top= downL1;tma_L1_group;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;Default;Scaled_Slot= s;TopdownL1;tma_L1_group;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Default;Offcore;Scaled_Slots;TopdownL1;tm= a_L1_group", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Default;Ret;Scaled_Slots;TopdownL1;tma_L1_gro= up", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to BACLEARS, which occurs when the Branch T= arget Buffer (BTB) prediction or lack thereof, was corrected by a later bra= nch predictor in the frontend", "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.BRANCH_DETECT@ / (5 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_branch_detect", - "MetricThreshold": "tma_branch_detect > 0.05 & (tma_ifetch_latency= > 0.15 & tma_frontend_bound > 0.2)", + "MetricThreshold": "(tma_branch_detect >0.05) & ((tma_ifetch_laten= cy >0.15) & ((tma_frontend_bound >0.20)))", "PublicDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to BACLEARS, which occurs when the Branch = Target Buffer (BTB) prediction or lack thereof, was corrected by a later br= anch predictor in the frontend. Includes BACLEARS due to all branch types i= ncluding conditional and unconditional jumps, returns, and indirect branche= s.", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to branch mispredicts", - "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.MISPREDICT@ / (5 *= cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", + "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * sl= ots", "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", "MetricName": "tma_branch_mispredicts", - "MetricThreshold": "tma_branch_mispredicts > 0.05 & tma_bad_specul= ation > 0.15", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost,= tma_mispredicts_resteers", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -170,456 +321,1919 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.BRANCH_RESTEER@ / (5 * cp= u_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_branch_resteer", - "MetricThreshold": "tma_branch_resteer > 0.05 & (tma_ifetch_latenc= y > 0.15 & tma_frontend_bound > 0.2)", + "MetricThreshold": "(tma_branch_resteer >0.05) & ((tma_ifetch_late= ncy >0.15) & ((tma_frontend_bound >0.20)))", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to the microcode sequencer (MS).", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.CISC@ / (5 * cpu_atom@CPU= _CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", - "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.05 & (tma_ifetch_bandwidth > 0.1 = & tma_frontend_bound > 0.2)", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers", + "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", + "MetricGroup": "Clocks;FetchLat;TopdownL3;tma_L3_group;tma_fetch_l= atency_group;tma_overlap", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of cycles due to backend bo= und stalls that are bounded by core restrictions and not attributed to an o= utstanding load or stores, or resource limitation", - "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS@ / (5 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_core_bound", - "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.1= ", - "MetricgroupNoGroup": "TopdownL2", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C01 / tma_info_thread_clks", + "MetricGroup": "C0Wait;Clocks;TopdownL4;tma_L4_group;tma_serializi= ng_operation_group", + "MetricName": "tma_c01_wait", + "MetricThreshold": "tma_c01_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to decode stalls.", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.DECODE@ / (5 * cpu_atom@C= PU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", - "MetricName": "tma_decode", - "MetricThreshold": "tma_decode > 0.05 & (tma_ifetch_bandwidth > 0.= 1 & tma_frontend_bound > 0.2)", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C02 / tma_info_thread_clks", + "MetricGroup": "C0Wait;Clocks;TopdownL4;tma_L4_group;tma_serializi= ng_operation_group", + "MetricName": "tma_c02_wait", + "MetricThreshold": "tma_c02_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that does not require the = use of microcode, classified as a fast nuke, due to memory ordering, memory= disambiguation and memory renaming", - "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.FASTNUKE@ / (5 * c= pu_atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", - "MetricName": "tma_fast_nuke", - "MetricThreshold": "tma_fast_nuke > 0.05 & (tma_machine_clears > 0= .05 & tma_bad_speculation > 0.15)", + "BriefDescription": "This metric estimates fraction of cycles the = CPU retired uops originated from CISC (complex instruction set computer) in= struction", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources. Sample with: FRONTEND_RETIRE= D.MS_FLOWS", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to frontend stalls.", - "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.ALL@ / (5 * cpu_atom@CPU_= CLK_UNHALTED.CORE@)", - "MetricGroup": "Default;TopdownL1;tma_L1_group", - "MetricName": "tma_frontend_bound", - "MetricThreshold": "tma_frontend_bound > 0.2", - "MetricgroupNoGroup": "TopdownL1;Default", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Machine Clears", + "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", + "MetricGroup": "BadSpec;Clocks;MachineClears;TopdownL4;tma_L4_grou= p;tma_branch_resteers_group;tma_issueMC", + "MetricName": "tma_clears_resteers", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to instruction cache misses.", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.ICACHE@ / (5 * cpu_atom@C= PU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", - "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_ifetch_latency= > 0.15 & tma_frontend_bound > 0.2)", + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "MetricExpr": "max(0, tma_icache_misses - tma_code_l2_miss)", + "MetricGroup": "Clocks_Retired;FetchLat;IcMiss;Offcore;TopdownL4;t= ma_L4_group;tma_icache_misses_group", + "MetricName": "tma_code_l2_hit", + "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend bandwidth restrictions due to d= ecode, predecode, cisc, and other limitations.", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH@ / (5 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_ifetch_bandwidth", - "MetricThreshold": "tma_ifetch_bandwidth > 0.1 & tma_frontend_boun= d > 0.2", - "MetricgroupNoGroup": "TopdownL2", + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_COD= E_RD / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;FetchLat;IcMiss;Offcore;TopdownL4;t= ma_L4_group;tma_icache_misses_group", + "MetricName": "tma_code_l2_miss", + "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend latency restrictions due to ica= che misses, itlb misses, branch detection, and resteer limitations.", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_LATENCY@ / (5 * = cpu_atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_ifetch_latency", - "MetricThreshold": "tma_ifetch_latency > 0.15 & tma_frontend_bound= > 0.2", - "MetricgroupNoGroup": "TopdownL2", + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", + "MetricGroup": "Clocks_Retired;FetchLat;MemoryTLB;TopdownL4;tma_L4= _group;tma_itlb_misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of time that retirement is stalled= due to a first level data TLB miss", - "MetricExpr": "100 * (cpu_atom@LD_HEAD.DTLB_MISS_AT_RET@ + cpu_ato= m@LD_HEAD.PGWALK_AT_RET@) / cpu_atom@CPU_CLK_UNHALTED.CORE@", - "MetricName": "tma_info_bottleneck_%_dtlb_miss_bound_cycles", + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;FetchLat;MemoryTLB;TopdownL4;tma_L4= _group;tma_itlb_misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of time that allocation and retire= ment is stalled by the Frontend Cluster due to an Ifetch Miss, either Icach= e or ITLB Miss", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.IFETCH@ / cpu_atom@= CPU_CLK_UNHALTED.CORE@", - "MetricGroup": "Ifetch", - "MetricName": "tma_info_bottleneck_%_ifetch_miss_bound_cycles", - "PublicDescription": "Percentage of time that allocation and retir= ement is stalled by the Frontend Cluster due to an Ifetch Miss, either Icac= he or ITLB Miss. See Info.Ifetch_Bound", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_2M_= 4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "Clocks_Estimated;FetchLat;MemoryTLB;TopdownL5;tma_= L5_group;tma_code_stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of time that retirement is stalled= due to an L1 miss", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.LOAD@ / cpu_atom@CP= U_CLK_UNHALTED.CORE@", - "MetricGroup": "Load_Store_Miss", - "MetricName": "tma_info_bottleneck_%_load_miss_bound_cycles", - "PublicDescription": "Percentage of time that retirement is stalle= d due to an L1 miss. See Info.Load_Miss_Bound", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_4K = / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "Clocks_Estimated;FetchLat;MemoryTLB;TopdownL5;tma_= L5_group;tma_code_stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of time that retirement is stalled= by the Memory Cluster due to a pipeline stall", - "MetricExpr": "100 * cpu_atom@LD_HEAD.ANY_AT_RET@ / cpu_atom@CPU_C= LK_UNHALTED.CORE@", - "MetricGroup": "Mem_Exec", - "MetricName": "tma_info_bottleneck_%_mem_exec_bound_cycles", - "PublicDescription": "Percentage of time that retirement is stalle= d by the Memory Cluster due to a pipeline stall. See Info.Mem_Exec_Bound", + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", + "MetricExpr": "((28 * tma_info_system_core_frequency - 3 * tma_inf= o_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (OCR.DEMAND_= DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEM= AND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) + (27 * tma_info_system_core_frequ= ency - 3 * tma_info_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XSNP_M= ISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_i= nfo_thread_clks", + "MetricGroup": "BvMS;Clocks_Estimated;DataSharing;LockCont;Offcore= ;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_contested_accesses", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false= _sharing, tma_machine_clears", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.ALL_BRANCHES@", - "MetricName": "tma_info_br_inst_mix_ipbranch", + "BriefDescription": "This metric represents fraction of slots wher= e Core non-memory issues were of a bottleneck", + "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)", + "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instruction per (near) call (lower number mea= ns higher occurrence rate)", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.CALL@", - "MetricName": "tma_info_br_inst_mix_ipcall", + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", + "MetricExpr": "(27 * tma_info_system_core_frequency - 3 * tma_info= _system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD + MEM_LOAD_L= 3_HIT_RETIRED.XSNP_FWD * (1 - OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.D= EMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_= FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma= _info_thread_clks", + "MetricGroup": "BvMS;Clocks_Estimated;Offcore;Snoop;TopdownL4;tma_= L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.FAR_BRANCH@u", - "MetricName": "tma_info_br_inst_mix_ipfarbranch", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to decode stalls.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.DECODE@ / (5 * cpu_atom@C= PU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_decode", + "MetricThreshold": "(tma_decode >0.05) & ((tma_ifetch_bandwidth >0= .10) & ((tma_frontend_bound >0.20)))", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per retired conditional Branch M= isprediction where the branch was not taken", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / (cpu_atom@BR_MISP_RETI= RED.COND@ - cpu_atom@BR_MISP_RETIRED.COND_TAKEN@)", - "MetricName": "tma_info_br_inst_mix_ipmisp_cond_ntaken", + "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", + "MetricExpr": "(cpu_core@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - = cpu_core@INST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks = / 2", + "MetricGroup": "DSBmiss;FetchBW;Slots_Estimated;TopdownL4;tma_L4_g= roup;tma_issueD0;tma_mite_group", + "MetricName": "tma_decoder0_alone", + "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per retired conditional Branch M= isprediction where the branch was taken", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.COND_TAKEN@", - "MetricName": "tma_info_br_inst_mix_ipmisp_cond_taken", + "BriefDescription": "This metric represents fraction of cycles whe= re the Divider unit was active", + "MetricExpr": "ARITH.DIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "BvCB;Clocks;TopdownL3;tma_L3_group;tma_core_bound_= group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per retired indirect call or jum= p Branch Misprediction", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.INDIRECT@", - "MetricName": "tma_info_br_inst_mix_ipmisp_indirect", + "BriefDescription": "This metric estimates how often the CPU was s= talled on accesses to external memory (DRAM) by loads", + "MetricExpr": "MEMORY_ACTIVITY.STALLS_L3_MISS / tma_info_thread_cl= ks", + "MetricGroup": "MemoryBound;Stalls;TmaL3mem;TopdownL3;tma_L3_group= ;tma_memory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per retired return Branch Mispre= diction", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.RETURN@", - "MetricName": "tma_info_br_inst_mix_ipmisp_ret", + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to DSB (decoded uop cache) fetch pipe= line", + "MetricExpr": "(IDQ.DSB_CYCLES_ANY - IDQ.DSB_CYCLES_OK) / tma_info= _core_core_clks / 2", + "MetricGroup": "DSB;FetchBW;Slots_Estimated;TopdownL3;tma_L3_group= ;tma_fetch_bandwidth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per retired Branch Misprediction= ", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.ALL_BRANCHES@", - "MetricName": "tma_info_br_inst_mix_ipmispredict", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", + "MetricGroup": "Clocks;DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma= _fetch_latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Ratio of all branches which mispredict", - "MetricExpr": "cpu_atom@BR_MISP_RETIRED.ALL_BRANCHES@ / cpu_atom@B= R_INST_RETIRED.ALL_BRANCHES@", - "MetricName": "tma_info_br_mispredict_bound_branch_mispredict_rati= o", + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricExpr": "min(7 * cpu_core@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\= \=3D0x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY = - MEMORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricGroup": "BvMT;Clocks_Estimated;MemoryTLB;TopdownL4;tma_L4_g= roup;tma_issueTLB;tma_l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Ratio between Mispredicted branches and unkno= wn branches", - "MetricExpr": "cpu_atom@BR_MISP_RETIRED.ALL_BRANCHES@ / cpu_atom@B= ACLEARS.ANY@", - "MetricName": "tma_info_br_mispredict_bound_branch_mispredict_to_u= nknown_branch_ratio", + "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", + "MetricExpr": "(7 * cpu_core@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\= =3D0x1@ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", + "MetricGroup": "BvMT;Clocks_Estimated;MemoryTLB;TopdownL4;tma_L4_g= roup;tma_issueTLB;tma_store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of time that allocation is stalled= due to load buffer full", - "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.LD_BUF@ / cpu_at= om@CPU_CLK_UNHALTED.CORE@", - "MetricName": "tma_info_buffer_stalls_%_load_buffer_stall_cycles", + "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", + "MetricExpr": "28 * tma_info_system_core_frequency * OCR.DEMAND_RF= O.L3_HIT.SNOOP_HITM / tma_info_thread_clks", + "MetricGroup": "BvMS;Clocks_Estimated;DataSharing;LockCont;Offcore= ;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_store_bound_group", + "MetricName": "tma_false_sharing", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of time that allocation is stalled= due to memory reservation stations full", - "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.RSV@ / cpu_atom@= CPU_CLK_UNHALTED.CORE@", - "MetricName": "tma_info_buffer_stalls_%_mem_rsv_stall_cycles", + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that does not require the = use of microcode, classified as a fast nuke, due to memory ordering, memory= disambiguation and memory renaming", + "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.FASTNUKE@ / (5 * c= pu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", + "MetricName": "tma_fast_nuke", + "MetricThreshold": "(tma_fast_nuke >0.05) & ((tma_machine_clears >= 0.05) & ((tma_bad_speculation >0.15)))", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of time that allocation is stalled= due to store buffer full", - "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.ST_BUF@ / cpu_at= om@CPU_CLK_UNHALTED.CORE@", - "MetricName": "tma_info_buffer_stalls_%_store_buffer_stall_cycles", + "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", + "MetricExpr": "L1D_PEND_MISS.FB_FULL / tma_info_thread_clks", + "MetricGroup": "BvMB;Clocks_Calculated;MemoryBW;TopdownL4;tma_L4_g= roup;tma_issueBW;tma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy, tma_streaming_stores", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Cycles Per Instruction", - "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.CORE@ / cpu_atom@INST_RET= IRED.ANY@", - "MetricName": "tma_info_core_cpi", + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend bandwidth issues", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)", + "MetricGroup": "Default;FetchBW;Frontend;Slots;TmaL2;TopdownL2;tma= _L2_group;tma_frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.2", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions Per Cycle", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@CPU_CLK_UNHAL= TED.CORE@", - "MetricName": "tma_info_core_ipc", + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend latency issues", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.U= OP_DROPPING / tma_info_thread_slots", + "MetricGroup": "Default;Frontend;Slots;TmaL2;TopdownL2;tma_L2_grou= p;tma_frontend_bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Uops Per Instruction", - "MetricExpr": "cpu_atom@UOPS_RETIRED.ALL@ / cpu_atom@INST_RETIRED.= ANY@", - "MetricName": "tma_info_core_upi", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", + "MetricExpr": "max(0, tma_heavy_operations - tma_microcode_sequenc= er)", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_heavy_operations_= group;tma_issueD0", + "MetricName": "tma_few_uops_instructions", + "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions.= Related metrics: tma_decoder0_alone", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L2", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.IFETCH_L2_HIT@ / cp= u_atom@MEM_BOUND_STALLS.IFETCH@", - "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 2hit", + "BriefDescription": "This metric represents overall arithmetic flo= ating-point (FP) operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", + "MetricGroup": "HPC;TopdownL3;Uops;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L3", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.IFETCH_LLC_HIT@ / c= pu_atom@MEM_BOUND_STALLS.IFETCH@", - "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 3hit", + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Floating Point (FP) Assists", + "MetricExpr": "30 * ASSISTS.FP / tma_info_thread_slots", + "MetricGroup": "HPC;Slots_Estimated;TopdownL5;tma_L5_group;tma_ass= ists_group", + "MetricName": "tma_fp_assists", + "MetricThreshold": "tma_fp_assists > 0.1", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss subsequently misses in the L3", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.IFETCH_DRAM_HIT@ / = cpu_atom@MEM_BOUND_STALLS.IFETCH@", - "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 3miss", + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "MetricExpr": "ARITH.FPDIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_fp_divider", + "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L2", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.LOAD_L2_HIT@ / cpu_= atom@MEM_BOUND_STALLS.LOAD@", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l2hit= ", + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) scalar uops fraction the CPU has retired", + "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / (tma_retiring * tma_= info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;Uops;tma_L4_group;tma_fp_a= rith_group;tma_issue2P", + "MetricName": "tma_fp_scalar", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma= _port_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L3", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.LOAD_LLC_HIT@ / cpu= _atom@MEM_BOUND_STALLS.LOAD@", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l3hit= ", + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) vector uops fraction the CPU has retired aggregated across all v= ector widths", + "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / (tma_retiring * tma_= info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;Uops;tma_L4_group;tma_fp_a= rith_group;tma_issue2P", + "MetricName": "tma_fp_vector", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized= _2", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that subsequently misses the L3", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.LOAD_DRAM_HIT@ / cp= u_atom@MEM_BOUND_STALLS.LOAD@", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l3mis= s", + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 128-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", + "MetricGroup": "Compute;Flops;TopdownL5;Uops;tma_L5_group;tma_fp_v= ector_group;tma_issue2P", + "MetricName": "tma_fp_vector_128b", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of cycles that the oldest l= oad of the load buffer is stalled at retirement due to a pipeline block", - "MetricExpr": "100 * cpu_atom@LD_HEAD.L1_BOUND_AT_RET@ / cpu_atom@= CPU_CLK_UNHALTED.CORE@", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_store_bound_l1_bound", + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 256-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", + "MetricGroup": "Compute;Flops;TopdownL5;Uops;tma_L5_group;tma_fp_v= ector_group;tma_issue2P", + "MetricName": "tma_fp_vector_256b", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_int_vector_128b, tma_int_vector_256b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of cycles that the oldest l= oad of the load buffer is stalled at retirement", - "MetricExpr": "100 * (cpu_atom@LD_HEAD.L1_BOUND_AT_RET@ + cpu_atom= @MEM_BOUND_STALLS.LOAD@) / cpu_atom@CPU_CLK_UNHALTED.CORE@", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_store_bound_load_bound", + "BriefDescription": "This category represents fraction of slots wh= ere the processor's Frontend undersupplies its Backend", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UO= P_DROPPING / tma_info_thread_slots", + "MetricGroup": "Default;TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of cycles the core is stall= ed due to store buffer full", - "MetricExpr": "100 * (cpu_atom@MEM_SCHEDULER_BLOCK.ST_BUF@ / cpu_a= tom@MEM_SCHEDULER_BLOCK.ALL@) * tma_mem_scheduler", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_store_bound_store_bound", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.MACRO_FUSED / (= tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Branches;BvBO;Pipeline;Slots;TopdownL3;tma_L3_grou= p;tma_light_operations_group", + "MetricName": "tma_fused_instructions", + "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to memory disambiguation", - "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.DISAMBIGUATION@ / cpu= _atom@INST_RETIRED.ANY@", - "MetricName": "tma_info_machine_clear_bound_machine_clears_disamb_= pki", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Default;Retire;Slots;TmaL2;TopdownL2;tma_L2_group;= tma_retiring_group", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+]). Sample with: UOPS_RET= IRED.HEAVY", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to floating point assists", - "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.FP_ASSIST@ / cpu_atom= @INST_RETIRED.ANY@", - "MetricName": "tma_info_machine_clear_bound_machine_clears_fp_assi= st_pki", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", + "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to memory ordering", - "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.MEMORY_ORDERING@ / cp= u_atom@INST_RETIRED.ANY@", - "MetricName": "tma_info_machine_clear_bound_machine_clears_monuke_= pki", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend bandwidth restrictions due to d= ecode, predecode, cisc, and other limitations.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH@ / (5 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_ifetch_bandwidth", + "MetricThreshold": "(tma_ifetch_bandwidth >0.10) & ((tma_frontend_= bound >0.20))", + "MetricgroupNoGroup": "TopdownL2", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to memory renaming", - "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.MRN_NUKE@ / cpu_atom@= INST_RETIRED.ANY@", - "MetricName": "tma_info_machine_clear_bound_machine_clears_mrn_pki= ", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend latency restrictions due to ica= che misses, itlb misses, branch detection, and resteer limitations.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_LATENCY@ / (5 * = cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_ifetch_latency", + "MetricThreshold": "(tma_ifetch_latency >0.15) & ((tma_frontend_bo= und >0.20))", + "MetricgroupNoGroup": "TopdownL2", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to page faults", - "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.PAGE_FAULT@ / cpu_ato= m@INST_RETIRED.ANY@", - "MetricName": "tma_info_machine_clear_bound_machine_clears_page_fa= ult_pki", + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 6 / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "MetricGroup": "Bad;BrMispredicts;Core_Metric;tma_issueBM", + "MetricName": "tma_info_bad_spec_branch_misprediction_cost", + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to self-modifying code", - "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.SMC@ / cpu_atom@INST_= RETIRED.ANY@", - "MetricName": "tma_info_machine_clear_bound_machine_clears_smc_pki= ", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", + "MetricGroup": "Bad;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", + "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of total non-speculative loads wit= h an address aliasing block", - "MetricExpr": "100 * cpu_atom@LD_BLOCKS.4K_ALIAS@ / cpu_atom@MEM_U= OPS_RETIRED.ALL_LOADS@", - "MetricName": "tma_info_mem_exec_blocks_%_loads_with_adressaliasin= g", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", + "MetricGroup": "Bad;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", + "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_taken < 200", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of total non-speculative loads wit= h a store forward or unknown store address block", - "MetricExpr": "100 * cpu_atom@LD_BLOCKS.DATA_UNKNOWN@ / cpu_atom@M= EM_UOPS_RETIRED.ALL_LOADS@", - "MetricName": "tma_info_mem_exec_blocks_%_loads_with_storefwdblk", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", + "MetricGroup": "Bad;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmisp_indirect", + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of Memory Execution Bound due to a= first level data cache miss", - "MetricExpr": "100 * cpu_atom@LD_HEAD.L1_MISS_AT_RET@ / cpu_atom@L= D_HEAD.ANY_AT_RET@", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_l1miss", + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", + "MetricGroup": "Bad;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmisp_ret", + "MetricThreshold": "tma_info_bad_spec_ipmisp_ret < 500", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of Memory Execution Bound due to o= ther block cases, such as pipeline conflicts, fences, etc", - "MetricExpr": "100 * cpu_atom@LD_HEAD.OTHER_AT_RET@ / cpu_atom@LD_= HEAD.ANY_AT_RET@", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_otherpipeli= neblks", + "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmispredict", + "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of Memory Execution Bound due to a= pagewalk", - "MetricExpr": "100 * cpu_atom@LD_HEAD.PGWALK_AT_RET@ / cpu_atom@LD= _HEAD.ANY_AT_RET@", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_pagewalk", + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", + "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", + "MetricGroup": "BrMispredicts;Metric", + "MetricName": "tma_info_bad_spec_spec_clears_ratio", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of Memory Execution Bound due to a= second level TLB miss", - "MetricExpr": "100 * cpu_atom@LD_HEAD.DTLB_MISS_AT_RET@ / cpu_atom= @LD_HEAD.ANY_AT_RET@", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_stlbhit", + "BriefDescription": "Probability of Core Bound bottleneck hidden b= y SMT-profiling artifacts", + "MetricExpr": "(100 * (1 - tma_core_bound / tma_ports_utilization = if tma_core_bound < tma_ports_utilization else 1) if tma_info_system_smt_2t= _utilization > 0.5 else 0)", + "MetricGroup": "Cor;Metric;SMT", + "MetricName": "tma_info_botlnk_l0_core_bound_likely", + "MetricThreshold": "tma_info_botlnk_l0_core_bound_likely > 0.5", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of Memory Execution Bound due to a= store forward address match", - "MetricExpr": "100 * cpu_atom@LD_HEAD.ST_ADDR_AT_RET@ / cpu_atom@L= D_HEAD.ANY_AT_RET@", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_storefwding= ", + "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_lsd + tma_ms)))", + "MetricGroup": "DSB;Fed;FetchBW;Scaled_Slots;tma_issueFB", + "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", + "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per Load", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@MEM_UOPS_RETI= RED.ALL_LOADS@", - "MetricName": "tma_info_mem_mix_ipload", + "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd + tma_ms))", + "MetricGroup": "DSBmiss;Fed;Scaled_Slots;tma_issueFB", + "MetricName": "tma_info_botlnk_l2_dsb_misses", + "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) misse= s - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb= _switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per Store", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@MEM_UOPS_RETI= RED.ALL_STORES@", - "MetricName": "tma_info_mem_mix_ipstore", + "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", + "MetricGroup": "Fed;FetchLat;IcMiss;Scaled_Slots;tma_issueFL", + "MetricName": "tma_info_botlnk_l2_ic_misses", + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of total non-speculative loads tha= t perform one or more locks", - "MetricExpr": "100 * cpu_atom@MEM_UOPS_RETIRED.LOCK_LOADS@ / cpu_a= tom@MEM_UOPS_RETIRED.ALL_LOADS@", - "MetricName": "tma_info_mem_mix_load_locks_ratio", + "BriefDescription": "Percentage of time that retirement is stalled= due to a first level data TLB miss", + "MetricExpr": "100 * (cpu_atom@LD_HEAD.DTLB_MISS_AT_RET@ + cpu_ato= m@LD_HEAD.PGWALK_AT_RET@) / cpu_atom@CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_bottleneck_%_dtlb_miss_bound_cycles", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of total non-speculative loads tha= t are splits", - "MetricExpr": "100 * cpu_atom@MEM_UOPS_RETIRED.SPLIT_LOADS@ / cpu_= atom@MEM_UOPS_RETIRED.ALL_LOADS@", - "MetricName": "tma_info_mem_mix_load_splits_ratio", + "BriefDescription": "Percentage of time that allocation and retire= ment is stalled by the Frontend Cluster due to an Ifetch Miss, either Icach= e or ITLB Miss", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.IFETCH@ / cpu_atom@= CPU_CLK_UNHALTED.CORE@", + "MetricGroup": "Ifetch", + "MetricName": "tma_info_bottleneck_%_ifetch_miss_bound_cycles", + "PublicDescription": "Percentage of time that allocation and retir= ement is stalled by the Frontend Cluster due to an Ifetch Miss, either Icac= he or ITLB Miss. See Info.Ifetch_Bound", "Unit": "cpu_atom" }, { - "BriefDescription": "Ratio of mem load uops to all uops", - "MetricExpr": "1e3 * cpu_atom@MEM_UOPS_RETIRED.ALL_LOADS@ / cpu_at= om@UOPS_RETIRED.ALL@", - "MetricName": "tma_info_mem_mix_memload_ratio", + "BriefDescription": "Percentage of time that retirement is stalled= due to an L1 miss", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.LOAD@ / cpu_atom@CP= U_CLK_UNHALTED.CORE@", + "MetricGroup": "Load_Store_Miss", + "MetricName": "tma_info_bottleneck_%_load_miss_bound_cycles", + "PublicDescription": "Percentage of time that retirement is stalle= d due to an L1 miss. See Info.Load_Miss_Bound", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of time that the core is stalled d= ue to a TPAUSE or UMWAIT instruction", - "MetricExpr": "100 * cpu_atom@SERIALIZATION.C01_MS_SCB@ / (5 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", - "MetricName": "tma_info_serialization _%_tpause_cycles", + "BriefDescription": "Percentage of time that retirement is stalled= by the Memory Cluster due to a pipeline stall", + "MetricExpr": "100 * cpu_atom@LD_HEAD.ANY_AT_RET@ / cpu_atom@CPU_C= LK_UNHALTED.CORE@", + "MetricGroup": "Mem_Exec", + "MetricName": "tma_info_bottleneck_%_mem_exec_bound_cycles", + "PublicDescription": "Percentage of time that retirement is stalle= d by the Memory Cluster due to a pipeline stall. See Info.Mem_Exec_Bound", "Unit": "cpu_atom" }, { - "BriefDescription": "Average CPU Utilization", - "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.REF_TSC@ / TSC", - "MetricName": "tma_info_system_cpu_utilization", + "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.ALL_BRANCHES@", + "MetricName": "tma_info_br_inst_mix_ipbranch", "Unit": "cpu_atom" }, { - "BriefDescription": "Fraction of cycles spent in Kernel mode", - "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.CORE_P@k / cpu_atom@CPU_C= LK_UNHALTED.CORE@", - "MetricGroup": "Summary", - "MetricName": "tma_info_system_kernel_utilization", + "BriefDescription": "Instruction per (near) call (lower number mea= ns higher occurrence rate)", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.CALL@", + "MetricName": "tma_info_br_inst_mix_ipcall", "Unit": "cpu_atom" }, { - "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", - "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.CORE@ / cpu_atom@CPU_CLK_= UNHALTED.REF_TSC@", - "MetricGroup": "Power", - "MetricName": "tma_info_system_turbo_utilization", + "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.FAR_BRANCH@u", + "MetricName": "tma_info_br_inst_mix_ipfarbranch", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of all uops which are FPDiv uops", - "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.FPDIV@ / cpu_atom@UOPS_= RETIRED.ALL@", - "MetricName": "tma_info_uop_mix_fpdiv_uop_ratio", + "BriefDescription": "Instructions per retired conditional Branch M= isprediction where the branch was not taken", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / (cpu_atom@BR_MISP_RETI= RED.COND@ - cpu_atom@BR_MISP_RETIRED.COND_TAKEN@)", + "MetricName": "tma_info_br_inst_mix_ipmisp_cond_ntaken", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of all uops which are IDiv uops", - "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.IDIV@ / cpu_atom@UOPS_R= ETIRED.ALL@", - "MetricName": "tma_info_uop_mix_idiv_uop_ratio", + "BriefDescription": "Instructions per retired conditional Branch M= isprediction where the branch was taken", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.COND_TAKEN@", + "MetricName": "tma_info_br_inst_mix_ipmisp_cond_taken", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of all uops which are microcode op= s", - "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.MS@ / cpu_atom@UOPS_RET= IRED.ALL@", - "MetricName": "tma_info_uop_mix_microcode_uop_ratio", + "BriefDescription": "Instructions per retired indirect call or jum= p Branch Misprediction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.INDIRECT@", + "MetricName": "tma_info_br_inst_mix_ipmisp_indirect", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired return Branch Mispre= diction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.RETURN@", + "MetricName": "tma_info_br_inst_mix_ipmisp_ret", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired Branch Misprediction= ", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.ALL_BRANCHES@", + "MetricName": "tma_info_br_inst_mix_ipmispredict", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Ratio of all branches which mispredict", + "MetricExpr": "cpu_atom@BR_MISP_RETIRED.ALL_BRANCHES@ / cpu_atom@B= R_INST_RETIRED.ALL_BRANCHES@", + "MetricName": "tma_info_br_mispredict_bound_branch_mispredict_rati= o", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Ratio between Mispredicted branches and unkno= wn branches", + "MetricExpr": "cpu_atom@BR_MISP_RETIRED.ALL_BRANCHES@ / cpu_atom@B= ACLEARS.ANY@", + "MetricName": "tma_info_br_mispredict_bound_branch_mispredict_to_u= nknown_branch_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are CALL or RET", + "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_R= ETURN) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;Fraction", + "MetricName": "tma_info_branches_callret", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are non-taken condi= tionals", + "MetricExpr": "BR_INST_RETIRED.COND_NTAKEN / BR_INST_RETIRED.ALL_B= RANCHES", + "MetricGroup": "Bad;Branches;CodeGen;Fraction;PGO", + "MetricName": "tma_info_branches_cond_nt", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are taken condition= als", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN / BR_INST_RETIRED.ALL_BR= ANCHES", + "MetricGroup": "Bad;Branches;CodeGen;Fraction;PGO", + "MetricName": "tma_info_branches_cond_tk", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are unconditional (= direct or indirect) jumps", + "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_= TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;Fraction", + "MetricName": "tma_info_branches_jump", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches of other types (not indi= vidually covered by other metrics in Info.Branches group)", + "MetricExpr": "1 - (tma_info_branches_cond_nt + tma_info_branches_= cond_tk + tma_info_branches_callret + tma_info_branches_jump)", + "MetricGroup": "Bad;Branches;Fraction", + "MetricName": "tma_info_branches_other_branches", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that allocation is stalled= due to load buffer full", + "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.LD_BUF@ / cpu_at= om@CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_buffer_stalls_%_load_buffer_stall_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that allocation is stalled= due to memory reservation stations full", + "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.RSV@ / cpu_atom@= CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_buffer_stalls_%_mem_rsv_stall_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that allocation is stalled= due to store buffer full", + "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.ST_BUF@ / cpu_at= om@CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_buffer_stalls_%_store_buffer_stall_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", + "MetricExpr": "(CPU_CLK_UNHALTED.DISTRIBUTED if #SMT_on else tma_i= nfo_thread_clks)", + "MetricGroup": "Count;SMT", + "MetricName": "tma_info_core_core_clks", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions Per Cycle across hyper-threads (= per physical core)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_core_core_clks", + "MetricGroup": "Core_Metric;Ret;SMT;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_info_core_coreipc", + "MetricgroupNoGroup": "TopdownL1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles Per Instruction", + "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.CORE@ / cpu_atom@INST_RET= IRED.ANY@", + "MetricName": "tma_info_core_cpi", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "uops Executed per Cycle", + "MetricExpr": "UOPS_EXECUTED.THREAD / tma_info_thread_clks", + "MetricGroup": "Metric;Power", + "MetricName": "tma_info_core_epc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Floating Point Operations Per Cycle", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / tma_info_core_core_clks", + "MetricGroup": "Core_Metric;Flops;Ret", + "MetricName": "tma_info_core_flopc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Actual per-core usage of the Floating Point n= on-X87 execution units (regardless of precision or vector-width)", + "MetricExpr": "(FP_ARITH_DISPATCHED.PORT_0 + FP_ARITH_DISPATCHED.P= ORT_1 + FP_ARITH_DISPATCHED.PORT_5) / (2 * tma_info_core_core_clks)", + "MetricGroup": "Cor;Core_Metric;Flops;HPC", + "MetricName": "tma_info_core_fp_arith_utilization", + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu_core@UOPS_EXECUTED.THREA= D\\,cmask\\=3D0x1@", + "MetricGroup": "Backend;Cor;Metric;Pipeline;PortsUtil", + "MetricName": "tma_info_core_ilp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions Per Cycle", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@CPU_CLK_UNHAL= TED.CORE@", + "MetricName": "tma_info_core_ipc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "cpu_atom@UOPS_RETIRED.ALL@ / cpu_atom@INST_RETIRED.= ANY@", + "MetricName": "tma_info_core_upi", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of Uops delivered by the DSB (aka De= coded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "DSB;Fed;FetchBW;Metric;tma_issueFB", + "MetricName": "tma_info_frontend_dsb_coverage", + "MetricThreshold": "tma_info_frontend_dsb_coverage < 0.7 & tma_inf= o_thread_ipc / 6 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu_core@DSB2MIT= E_SWITCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricGroup": "DSBmiss;Metric", + "MetricName": "tma_info_frontend_dsb_switch_cost", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", + "MetricExpr": "UOPS_ISSUED.ANY / cpu_core@UOPS_ISSUED.ANY\\,cmask\= \=3D0x1@", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_frontend_fetch_upc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Latency for L1 instruction cache miss= es", + "MetricExpr": "ICACHE_DATA.STALLS / cpu_core@ICACHE_DATA.STALLS\\,= cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricGroup": "Fed;FetchLat;IcMiss;Metric", + "MetricName": "tma_info_frontend_icache_miss_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per non-speculative DSB miss (lo= wer number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", + "MetricGroup": "DSBmiss;Fed;Inst_Metric", + "MetricName": "tma_info_frontend_ipdsb_miss_ret", + "MetricThreshold": "tma_info_frontend_ipdsb_miss_ret < 50", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per speculative Unknown Branch M= isprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_inst_mix_instructions / BACLEARS.ANY", + "MetricGroup": "Fed;Metric", + "MetricName": "tma_info_frontend_ipunknown_branch", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache true code cacheline misses per kilo = instruction", + "MetricExpr": "1e3 * FRONTEND_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss;Metric", + "MetricName": "tma_info_frontend_l2mpki_code", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache speculative code cacheline misses pe= r kilo instruction", + "MetricExpr": "1e3 * L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss;Metric", + "MetricName": "tma_info_frontend_l2mpki_code_all", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of Uops delivered by the LSD (Loop S= tream Detector; aka Loop Cache)", + "MetricExpr": "LSD.UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "Fed;LSD;Metric", + "MetricName": "tma_info_frontend_lsd_coverage", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW;Metric", + "MetricName": "tma_info_frontend_tbpc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of cycles the front-end was de= layed due to an Unknown Branch detection", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu_core@INT_MISC.= UNKNOWN_BRANCH_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricGroup": "Fed;Metric", + "MetricName": "tma_info_frontend_unknown_branch_cost", + "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L2", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.IFETCH_L2_HIT@ / cp= u_atom@MEM_BOUND_STALLS.IFETCH@", + "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 2hit", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss doesn't hit in the L2", + "MetricExpr": "100 * (cpu_atom@MEM_BOUND_STALLS.IFETCH_LLC_HIT@ + = cpu_atom@MEM_BOUND_STALLS.IFETCH_DRAM_HIT@) / cpu_atom@MEM_BOUND_STALLS.IFE= TCH@", + "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 2miss", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L3", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.IFETCH_LLC_HIT@ / c= pu_atom@MEM_BOUND_STALLS.IFETCH@", + "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 3hit", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss subsequently misses in the L3", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.IFETCH_DRAM_HIT@ / = cpu_atom@MEM_BOUND_STALLS.IFETCH@", + "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 3miss", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Branch instructions per taken branch", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", + "MetricGroup": "Branches;Fed;Metric;PGO", + "MetricName": "tma_info_inst_mix_bptkbranch", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Count;Summary;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_info_inst_mix_instructions", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "Total number of retired Instructions. Sample= with: INST_RETIRED.PREC_DIST", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic instruction (l= ower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = FP_ARITH_INST_RETIRED.VECTOR)", + "MetricGroup": "Flops;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith", + "MetricThreshold": "tma_info_inst_mix_iparith < 10", + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_avx128", + "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_avx256", + "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOU= BLE", + "MetricGroup": "Flops;FpScalar;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_scalar_dp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SIN= GLE", + "MetricGroup": "Flops;FpScalar;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_scalar_sp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Branches;Fed;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipbranch", + "MetricThreshold": "tma_info_inst_mix_ipbranch < 8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per (near) call (lower number me= ans higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", + "MetricGroup": "Branches;Fed;Inst_Metric;PGO", + "MetricName": "tma_info_inst_mix_ipcall", + "MetricThreshold": "tma_info_inst_mix_ipcall < 200", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Floating Point (FP) Operatio= n (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_= FLOPS + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipflop", + "MetricThreshold": "tma_info_inst_mix_ipflop < 10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Load (lower number means hig= her occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", + "MetricGroup": "InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipload", + "MetricThreshold": "tma_info_inst_mix_ipload < 3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per PAUSE (lower number means hi= gher occurrence rate)", + "MetricExpr": "tma_info_inst_mix_instructions / CPU_CLK_UNHALTED.P= AUSE_INST", + "MetricGroup": "Flops;FpVector;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ippause", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Store (lower number means hi= gher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", + "MetricGroup": "InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipstore", + "MetricThreshold": "tma_info_inst_mix_ipstore < 8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", + "MetricExpr": "INST_RETIRED.ANY / SW_PREFETCH_ACCESS.ANY", + "MetricGroup": "Inst_Metric;Prefetches", + "MetricName": "tma_info_inst_mix_ipswpf", + "MetricThreshold": "tma_info_inst_mix_ipswpf < 100", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per taken branch", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;Inst_Metric;PGO;tma_= issueFB", + "MetricName": "tma_info_inst_mix_iptb", + "MetricThreshold": "tma_info_inst_mix_iptb < 6 * 2 + 1", + "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L2", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.LOAD_L2_HIT@ / cpu_= atom@MEM_BOUND_STALLS.LOAD@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l2hit= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that subsequently misses in the L2", + "MetricExpr": "100 * (cpu_atom@MEM_BOUND_STALLS.LOAD_LLC_HIT@ + cp= u_atom@MEM_BOUND_STALLS.LOAD_DRAM_HIT@) / cpu_atom@MEM_BOUND_STALLS.LOAD@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l2mis= s", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L3", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.LOAD_LLC_HIT@ / cpu= _atom@MEM_BOUND_STALLS.LOAD@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l3hit= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that subsequently misses the L3", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS.LOAD_DRAM_HIT@ / cp= u_atom@MEM_BOUND_STALLS.LOAD@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l3mis= s", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the oldest l= oad of the load buffer is stalled at retirement due to a pipeline block", + "MetricExpr": "100 * cpu_atom@LD_HEAD.L1_BOUND_AT_RET@ / cpu_atom@= CPU_CLK_UNHALTED.CORE@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_store_bound_l1_bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the oldest l= oad of the load buffer is stalled at retirement", + "MetricExpr": "100 * (cpu_atom@LD_HEAD.L1_BOUND_AT_RET@ + cpu_atom= @MEM_BOUND_STALLS.LOAD@) / cpu_atom@CPU_CLK_UNHALTED.CORE@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_store_bound_load_bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to store buffer full", + "MetricExpr": "100 * (cpu_atom@MEM_SCHEDULER_BLOCK.ST_BUF@ / cpu_a= tom@MEM_SCHEDULER_BLOCK.ALL@) * tma_mem_scheduler", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_store_bound_store_bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to memory disambiguation", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.DISAMBIGUATION@ / cpu= _atom@INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_disamb_= pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to floating point assists", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.FP_ASSIST@ / cpu_atom= @INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_fp_assi= st_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to memory ordering", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.MEMORY_ORDERING@ / cp= u_atom@INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_monuke_= pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to memory renaming", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.MRN_NUKE@ / cpu_atom@= INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_mrn_pki= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to page faults", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.PAGE_FAULT@ / cpu_ato= m@INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_page_fa= ult_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to self-modifying code", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.SMC@ / cpu_atom@INST_= RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_smc_pki= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads wit= h an address aliasing block", + "MetricExpr": "100 * cpu_atom@LD_BLOCKS.4K_ALIAS@ / cpu_atom@MEM_U= OPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_exec_blocks_%_loads_with_adressaliasin= g", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads wit= h a store forward or unknown store address block", + "MetricExpr": "100 * cpu_atom@LD_BLOCKS.DATA_UNKNOWN@ / cpu_atom@M= EM_UOPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_exec_blocks_%_loads_with_storefwdblk", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= first level data cache miss", + "MetricExpr": "100 * cpu_atom@LD_HEAD.L1_MISS_AT_RET@ / cpu_atom@L= D_HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_l1miss", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to o= ther block cases, such as pipeline conflicts, fences, etc", + "MetricExpr": "100 * cpu_atom@LD_HEAD.OTHER_AT_RET@ / cpu_atom@LD_= HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_otherpipeli= neblks", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= pagewalk", + "MetricExpr": "100 * cpu_atom@LD_HEAD.PGWALK_AT_RET@ / cpu_atom@LD= _HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_pagewalk", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= second level TLB miss", + "MetricExpr": "100 * cpu_atom@LD_HEAD.DTLB_MISS_AT_RET@ / cpu_atom= @LD_HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_stlbhit", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= store forward address match", + "MetricExpr": "100 * cpu_atom@LD_HEAD.ST_ADDR_AT_RET@ / cpu_atom@L= D_HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_storefwding= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Load", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@MEM_UOPS_RETI= RED.ALL_LOADS@", + "MetricName": "tma_info_mem_mix_ipload", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Store", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@MEM_UOPS_RETI= RED.ALL_STORES@", + "MetricName": "tma_info_mem_mix_ipstore", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads tha= t perform one or more locks", + "MetricExpr": "100 * cpu_atom@MEM_UOPS_RETIRED.LOCK_LOADS@ / cpu_a= tom@MEM_UOPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_mix_load_locks_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads tha= t are splits", + "MetricExpr": "100 * cpu_atom@MEM_UOPS_RETIRED.SPLIT_LOADS@ / cpu_= atom@MEM_UOPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_mix_load_splits_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Ratio of mem load uops to all uops", + "MetricExpr": "1e3 * cpu_atom@MEM_UOPS_RETIRED.ALL_LOADS@ / cpu_at= om@UOPS_RETIRED.ALL@", + "MetricName": "tma_info_mem_mix_memload_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-core data fill bandwidth to the L= 1 data cache [GB / sec]", + "MetricExpr": "tma_info_memory_l1d_cache_fill_bw", + "MetricGroup": "Core_Metric;Mem;MemoryBW", + "MetricName": "tma_info_memory_core_l1d_cache_fill_bw_2t", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-core data fill bandwidth to the L= 2 cache [GB / sec]", + "MetricExpr": "tma_info_memory_l2_cache_fill_bw", + "MetricGroup": "Core_Metric;Mem;MemoryBW", + "MetricName": "tma_info_memory_core_l2_cache_fill_bw_2t", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-core data access bandwidth to the= L3 cache [GB / sec]", + "MetricExpr": "tma_info_memory_l3_cache_access_bw", + "MetricGroup": "Core_Metric;Mem;MemoryBW;Offcore", + "MetricName": "tma_info_memory_core_l3_cache_access_bw_2t", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-core data fill bandwidth to the L= 3 cache [GB / sec]", + "MetricExpr": "tma_info_memory_l3_cache_fill_bw", + "MetricGroup": "Core_Metric;Mem;MemoryBW", + "MetricName": "tma_info_memory_core_l3_cache_fill_bw_2t", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fill Buffer (FB) hits per kilo instructions f= or retired demand loads (L1D misses that merge into ongoing miss-handling e= ntries)", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_fb_hpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", + "MetricGroup": "Mem;MemoryBW;Metric", + "MetricName": "tma_info_memory_l1d_cache_fill_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L1 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l1mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L1 cache true misses per kilo instruction for= all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.AN= Y", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l1mpki_load", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", + "MetricGroup": "Mem;MemoryBW;Metric", + "MetricName": "tma_info_memory_l2_cache_fill_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache hits per kilo instruction for all re= quest types (including speculative)", + "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_= RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2hpki_all", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache hits per kilo instruction for all de= mand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.AN= Y", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2hpki_load", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "Backend;CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all request types (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric;Offcore", + "MetricName": "tma_info_memory_l2mpki_all", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.A= NY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2mpki_load", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Offcore requests (L2 cache miss) per kilo ins= truction for demand RFOs", + "MetricExpr": "1e3 * L2_RQSTS.RFO_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Metric;Offcore", + "MetricName": "tma_info_memory_l2mpki_rfo", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", + "MetricGroup": "Mem;MemoryBW;Metric;Offcore", + "MetricName": "tma_info_memory_l3_cache_access_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", + "MetricGroup": "Mem;MemoryBW;Metric", + "MetricName": "tma_info_memory_l3_cache_fill_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L3 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", + "MetricGroup": "Mem;Metric", + "MetricName": "tma_info_memory_l3mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DATA_RD / OFFCORE_REQU= ESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Metric;Offcore", + "MetricName": "tma_info_memory_latency_data_l2_mlp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Latency for L2 cache miss demand Load= s", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "Clocks_Latency;LockCont;Memory_Lat;Offcore", + "MetricName": "tma_info_memory_latency_load_l2_miss_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu_c= ore@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", + "MetricGroup": "Memory_BW;Metric;Offcore", + "MetricName": "tma_info_memory_latency_load_l2_mlp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Latency for L3 cache miss demand Load= s", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAND_DATA_RD= / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", + "MetricGroup": "Clocks_Latency;Memory_Lat;Offcore", + "MetricName": "tma_info_memory_latency_load_l3_miss_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Actual Average Latency for L1 data-cache miss= demand load operations (in core cycles)", + "MetricExpr": "L1D_PEND_MISS.PENDING / MEM_LOAD_COMPLETED.L1_MISS_= ANY", + "MetricGroup": "Clocks_Latency;Mem;MemoryBound;MemoryLat", + "MetricName": "tma_info_memory_load_miss_real_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "\"Bus lock\" per kilo instruction", + "MetricExpr": "1e3 * SQ_MISC.BUS_LOCK / INST_RETIRED.ANY", + "MetricGroup": "Mem;Metric", + "MetricName": "tma_info_memory_mix_bus_lock_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Un-cacheable retired load per kilo instructio= n", + "MetricExpr": "1e3 * MEM_LOAD_MISC_RETIRED.UC / INST_RETIRED.ANY", + "MetricGroup": "Mem;Metric", + "MetricName": "tma_info_memory_mix_uc_load_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Memory-Level-Parallelism (average number of L= 1 miss demand load when there is at least one such miss", + "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLE= S", + "MetricGroup": "Mem;MemoryBW;MemoryBound;Metric", + "MetricName": "tma_info_memory_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Rate of L2 HW prefetched lines that were not = used by demand accesses", + "MetricExpr": "L2_LINES_OUT.USELESS_HWPF / (L2_LINES_OUT.SILENT + = L2_LINES_OUT.NON_SILENT)", + "MetricGroup": "Metric;Prefetches", + "MetricName": "tma_info_memory_prefetches_useless_hwpf", + "MetricThreshold": "tma_info_memory_prefetches_useless_hwpf > 0.15= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", + "MetricGroup": "Fed;MemoryTLB;Metric", + "MetricName": "tma_info_memory_tlb_code_stlb_mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "STLB (2nd level TLB) data load speculative mi= sses per kilo instruction (misses of any page-size that complete the page w= alk)", + "MetricExpr": "1e3 * DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRE= D.ANY", + "MetricGroup": "Mem;MemoryTLB;Metric", + "MetricName": "tma_info_memory_tlb_load_stlb_mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Utilization of the core's Page Walker(s) serv= ing STLB misses triggered by instruction/Load/Store accesses", + "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_P= ENDING + DTLB_STORE_MISSES.WALK_PENDING) / (4 * tma_info_core_core_clks)", + "MetricGroup": "Core_Metric;Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_page_walks_utilization", + "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "STLB (2nd level TLB) data store speculative m= isses per kilo instruction (misses of any page-size that complete the page = walk)", + "MetricExpr": "1e3 * DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIR= ED.ANY", + "MetricGroup": "Mem;MemoryTLB;Metric", + "MetricName": "tma_info_memory_tlb_store_stlb_mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu_core@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", + "MetricGroup": "Cor;Metric;Pipeline;PortsUtil;SMT", + "MetricName": "tma_info_pipeline_execute", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of uops fetched from DSB per c= ycle", + "MetricExpr": "IDQ.DSB_UOPS / IDQ.DSB_CYCLES_ANY", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_pipeline_fetch_dsb", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of uops fetched from LSD per c= ycle", + "MetricExpr": "LSD.UOPS / LSD.CYCLES_ACTIVE", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_pipeline_fetch_lsd", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of uops fetched from MITE per = cycle", + "MetricExpr": "IDQ.MITE_UOPS / IDQ.MITE_CYCLES_ANY", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_pipeline_fetch_mite", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per a microcode Assist invocatio= n", + "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", + "MetricGroup": "Inst_Metric;MicroSeq;Pipeline;Ret;Retire", + "MetricName": "tma_info_pipeline_ipassist", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", + "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D0x1@", + "MetricGroup": "Metric;Pipeline;Ret", + "MetricName": "tma_info_pipeline_retire", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", + "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu_core@UOPS_RETIRED.= SLOTS\\,cmask\\=3D0x1@", + "MetricGroup": "Metric;MicroSeq;Pipeline;Ret", + "MetricName": "tma_info_pipeline_strings_cycles", + "MetricThreshold": "tma_info_pipeline_strings_cycles > 0.1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that the core is stalled d= ue to a TPAUSE or UMWAIT instruction", + "MetricExpr": "100 * cpu_atom@SERIALIZATION.C01_MS_SCB@ / (5 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", + "MetricName": "tma_info_serialization _%_tpause_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of cycles the processor is waiting y= et unhalted; covering legacy PAUSE instruction, as well as C0.1 / C0.2 powe= r-performance optimized states", + "MetricExpr": "CPU_CLK_UNHALTED.C0_WAIT / tma_info_thread_clks", + "MetricGroup": "C0Wait;Metric", + "MetricName": "tma_info_system_c0_wait", + "MetricThreshold": "tma_info_system_c0_wait > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", + "MetricGroup": "Power;Summary;System_Metric", + "MetricName": "tma_info_system_core_frequency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average CPU Utilization (percentage)", + "MetricExpr": "tma_info_system_cpus_utilized / #num_cpus_online", + "MetricGroup": "HPC;Metric;Summary", + "MetricName": "tma_info_system_cpu_utilization", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of utilized CPUs", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": "Metric;Summary", + "MetricName": "tma_info_system_cpus_utilized", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", + "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / tma_info_system_time / 1e3", + "MetricGroup": "GB/sec;HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", + "MetricName": "tma_info_system_dram_bw_use", + "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_bottleneck_cache_memory_ban= dwidth, tma_fb_full, tma_mem_bandwidth, tma_sq_full", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / tma_info_system_time", + "MetricGroup": "Cor;Flops;HPC;Metric", + "MetricName": "tma_info_system_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;Inst_Metric;OS", + "MetricName": "tma_info_system_ipfarbranch", + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricGroup": "Metric;OS", + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THRE= AD", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_kernel_utilization", + "MetricThreshold": "tma_info_system_kernel_utilization > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Clocks;Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "power@energy\\-pkg@ * 61 / (tma_info_system_time * = 1e6)", + "MetricGroup": "Power;SoC;System_Metric", + "MetricName": "tma_info_system_power", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", + "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_U= NHALTED.REF_DISTRIBUTED if #SMT_on else 0)", + "MetricGroup": "Core_Metric;SMT", + "MetricName": "tma_info_system_smt_2t_utilization", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Socket actual clocks when any core is active = on that socket", + "MetricExpr": "UNC_CLOCK.SOCKET", + "MetricGroup": "Count;SoC", + "MetricName": "tma_info_system_socket_clks", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Seconds;Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", + "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_system_turbo_utilization", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Count;Pipeline", + "MetricName": "tma_info_thread_clks", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", + "MetricExpr": "1 / tma_info_thread_ipc", + "MetricGroup": "Mem;Metric;Pipeline", + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "The ratio of Executed- by Issued-Uops", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", + "MetricGroup": "Cor;Metric;Pipeline", + "MetricName": "tma_info_thread_execute_per_issue", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", + "MetricExpr": "INST_RETIRED.ANY / tma_info_thread_clks", + "MetricGroup": "Metric;Ret;Summary", + "MetricName": "tma_info_thread_ipc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "slots", + "MetricGroup": "Count;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_info_thread_slots", + "MetricgroupNoGroup": "TopdownL1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", + "MetricExpr": "(tma_info_thread_slots / (slots / 2) if #SMT_on els= e 1)", + "MetricGroup": "Metric;SMT;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_info_thread_slots_utilization", + "MetricgroupNoGroup": "TopdownL1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "tma_retiring * tma_info_thread_slots / INST_RETIRED= .ANY", + "MetricGroup": "Metric;Pipeline;Ret;Retire", + "MetricName": "tma_info_thread_uoppi", + "MetricThreshold": "tma_info_thread_uoppi > 1.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Uops per taken branch", + "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Metric", + "MetricName": "tma_info_thread_uptb", + "MetricThreshold": "tma_info_thread_uptb < 6 * 1.5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are FPDiv uops", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.FPDIV@ / cpu_atom@UOPS_= RETIRED.ALL@", + "MetricName": "tma_info_uop_mix_fpdiv_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are IDiv uops", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.IDIV@ / cpu_atom@UOPS_R= ETIRED.ALL@", + "MetricName": "tma_info_uop_mix_idiv_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are microcode op= s", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.MS@ / cpu_atom@UOPS_RET= IRED.ALL@", + "MetricName": "tma_info_uop_mix_microcode_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are x87 uops", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.X87@ / cpu_atom@UOPS_RE= TIRED.ALL@", + "MetricName": "tma_info_uop_mix_x87_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "MetricExpr": "tma_divider - tma_fp_divider", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_int_divider", + "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents overall Integer (Int) = select operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_int_vector_128b + tma_int_vector_256b", + "MetricGroup": "Pipeline;TopdownL3;Uops;tma_L3_group;tma_light_ope= rations_group", + "MetricName": "tma_int_operations", + "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operation= s > 0.6", + "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents 128-bit vector Integer= ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the= CPU has retired", + "MetricExpr": "(INT_VEC_RETIRED.ADD_128 + INT_VEC_RETIRED.VNNI_128= ) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;Uops;tma_L4_g= roup;tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_128b", + "MetricThreshold": "tma_int_vector_128b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_int_vector_256b, tma_port_0, tma_port_1,= tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents 256-bit vector Integer= ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fraction= the CPU has retired", + "MetricExpr": "(INT_VEC_RETIRED.ADD_256 + INT_VEC_RETIRED.MUL_256 = + INT_VEC_RETIRED.VNNI_256) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;Uops;tma_L4_g= roup;tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_256b", + "MetricThreshold": "tma_int_vector_256b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_port_0, tma_por= t_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", + "MetricExpr": "max((EXE_ACTIVITY.BOUND_ON_LOADS - MEMORY_ACTIVITY.= STALLS_L1D_MISS) / tma_info_thread_clks, 0)", + "MetricGroup": "CacheHits;MemoryBound;Stalls;TmaL3mem;TopdownL3;tm= a_L3_group;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", + "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - MEMORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricGroup": "BvML;Clocks_Estimated;MemoryLat;TopdownL4;tma_L4_g= roup;tma_l1_bound_group", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", + "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L1D_MISS - MEMORY_ACTIVITY.= STALLS_L2_MISS) / tma_info_thread_clks", + "MetricGroup": "BvML;CacheHits;MemoryBound;Stalls;TmaL3mem;Topdown= L3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "3 * tma_info_system_core_frequency * MEM_LOAD_RETIR= ED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / = tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;MemoryLat;TopdownL4;tma_L4_group;tm= a_l2_bound_group", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled due to loads accesses to L3 cache or contended with a sibling Core", + "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L2_MISS - MEMORY_ACTIVITY.S= TALLS_L3_MISS) / tma_info_thread_clks", + "MetricGroup": "CacheHits;MemoryBound;Stalls;TmaL3mem;TopdownL3;tm= a_L3_group;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", + "MetricExpr": "(12 * tma_info_system_core_frequency - 3 * tma_info= _system_core_frequency) * (MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.= FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) / tma_info_thread_clks", + "MetricGroup": "BvML;Clocks_Estimated;MemoryLat;TopdownL4;tma_L4_g= roup;tma_issueLat;tma_l3_bound_group;tma_overlap", + "MetricName": "tma_l3_hit_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "DECODE.LCP / tma_info_thread_clks", + "MetricGroup": "Clocks;FetchLat;TopdownL3;tma_L3_group;tma_fetch_l= atency_group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", + "MetricGroup": "Default;Retire;Slots;TmaL2;TopdownL2;tma_L2_group;= tma_retiring_group", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Load operations", + "MetricExpr": "UOPS_DISPATCHED.PORT_2_3_10 / (3 * tma_info_core_co= re_clks)", + "MetricGroup": "Core_Execution;TopdownL5;tma_L5_group;tma_ports_ut= ilized_3m_group", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Load operations. Sample with: = UOPS_DISPATCHED.PORT_2_3_10", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) DTLB was missed by load accesses, that late= r on hit in second-level TLB (STLB)", + "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL5;tma_L5_group;= tma_dtlb_load_group", + "MetricName": "tma_load_stlb_hit", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by load accesses, performing a= hardware page walk", + "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", + "MetricGroup": "Clocks_Calculated;MemoryTLB;TopdownL5;tma_L5_group= ;tma_dtlb_load_group", + "MetricName": "tma_load_stlb_miss", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_load_stlb_miss_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_load_stlb_miss_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_load_stlb_miss_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", + "MetricExpr": "(16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10= * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAN= DING.CYCLES_WITH_DEMAND_RFO))) / tma_info_thread_clks", + "MetricGroup": "Clocks;LockCont;Offcore;TopdownL4;tma_L4_group;tma= _issueRFO;tma_l1_bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to LSD (Loop Stream Detector) unit", + "MetricExpr": "(LSD.CYCLES_ACTIVE - LSD.CYCLES_OK) / tma_info_core= _core_clks / 2", + "MetricGroup": "FetchBW;LSD;Slots_Estimated;TopdownL3;tma_L3_group= ;tma_fetch_bandwidth_group", + "MetricName": "tma_lsd", + "MetricThreshold": "tma_lsd > 0.15 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Machine Clears", + "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts= )", + "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu_core@OFFCORE_REQUE= STS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;Clocks;MemoryBW;Offcore;TopdownL4;tma_L4_grou= p;tma_dram_bound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the performance was likely hurt due to latency from external memory - DRA= M ([SPR-HBM] and/or HBM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", + "MetricGroup": "BvML;Clocks;MemoryLat;Offcore;TopdownL4;tma_L4_gro= up;tma_dram_bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to memory reservation stalls in which a sched= uler is not able to accept uops", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.MEM_SCHEDULER@ / (5 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_mem_scheduler", + "MetricThreshold": "(tma_mem_scheduler >0.10) & ((tma_resource_bou= nd >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Backend;Default;Slots;TmaL2;TopdownL2;tma_L2_group= ;tma_backend_bound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "13 * MISC2_RETIRED.LFENCE / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", + "MetricName": "tma_memory_fence", + "MetricThreshold": "tma_memory_fence > 0.05 & tma_serializing_oper= ation > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "MetricExpr": "tma_light_operations * MEM_UOP_RETIRED.ANY / (tma_r= etiring * tma_info_thread_slots)", + "MetricGroup": "Pipeline;Slots;TopdownL3;tma_L3_group;tma_light_op= erations_group", + "MetricName": "tma_memory_operations", + "MetricThreshold": "tma_memory_operations > 0.1 & tma_light_operat= ions > 0.6", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "UOPS_RETIRED.MS / tma_info_thread_slots", + "MetricGroup": "MicroSeq;Slots;TopdownL3;tma_L3_group;tma_heavy_op= erations_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_bottleneck_i= rregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, t= ma_ms_switches", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of all uops which are x87 uops", - "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.X87@ / cpu_atom@UOPS_RE= TIRED.ALL@", - "MetricName": "tma_info_uop_mix_x87_uop_ratio", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Branch Misprediction= at execution stage", + "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * INT_= MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", + "MetricGroup": "BadSpec;BrMispredicts;BvMP;Clocks;TopdownL4;tma_L4= _group;tma_branch_resteers_group;tma_issueBM", + "MetricName": "tma_mispredicts_resteers", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to Instruction Table Lookaside Buffer (ITLB= ) misses.", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.ITLB@ / (5 * cpu_atom@CPU= _CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", - "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_ifetch_latency >= 0.15 & tma_frontend_bound > 0.2)", + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the MITE pipeline (the legacy deco= de pipeline)", + "MetricExpr": "(IDQ.MITE_CYCLES_ANY - IDQ.MITE_CYCLES_OK) / tma_in= fo_core_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;Slots_Estimated;TopdownL3;tma_L3_g= roup;tma_fetch_bandwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sa= mple with: FRONTEND_RETIRED.ANY_DSB_MISS", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend because allocation is stalled due to a mach= ine clear (nuke) of any kind including memory ordering and memory disambigu= ation", - "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.MACHINE_CLEARS@ / = (5 * cpu_atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_machine_clears", - "MetricThreshold": "tma_machine_clears > 0.05 & tma_bad_speculatio= n > 0.15", - "MetricgroupNoGroup": "TopdownL2", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "MetricExpr": "160 * ASSISTS.SSE_AVX_MIX / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL5;tma_L5_group;tma_issueMV;tma_port= s_utilized_0_group", + "MetricName": "tma_mixing_vectors", + "MetricThreshold": "tma_mixing_vectors > 0.05", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to memory reservation stalls in which a sched= uler is not able to accept uops", - "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.MEM_SCHEDULER@ / (5 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", - "MetricName": "tma_mem_scheduler", - "MetricThreshold": "tma_mem_scheduler > 0.1 & (tma_resource_bound = > 0.2 & tma_backend_bound > 0.1)", + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "MetricExpr": "max(IDQ.MS_CYCLES_ANY, cpu_core@UOPS_RETIRED.MS\\,c= mask\\=3D0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY)) / tma_info_core_cor= e_clks / 2", + "MetricGroup": "MicroSeq;Slots_Estimated;TopdownL3;tma_L3_group;tm= a_fetch_bandwidth_group", + "MetricName": "tma_ms", + "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", + "MetricExpr": "3 * cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D0x1\\,edge= \\=3D0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY) / tma_info_thread_clks", + "MetricGroup": "Clocks_Estimated;FetchLat;MicroSeq;TopdownL3;tma_L= 3_group;tma_fetch_latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_iss= ueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: FRONTEND_RETIRED.MS_FLOWS. Related metrics: tm= a_bottleneck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_mac= hine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_o= peration", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions that were not fused", + "MetricExpr": "tma_light_operations * (BR_INST_RETIRED.ALL_BRANCHE= S - INST_RETIRED.MACRO_FUSED) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Branches;BvBO;Pipeline;Slots;TopdownL3;tma_L3_grou= p;tma_light_operations_group", + "MetricName": "tma_non_fused_branches", + "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -628,82 +2242,389 @@ "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER@ / (5 *= cpu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_non_mem_scheduler", - "MetricThreshold": "tma_non_mem_scheduler > 0.1 & (tma_resource_bo= und > 0.2 & tma_backend_bound > 0.1)", + "MetricThreshold": "(tma_non_mem_scheduler >0.10) & ((tma_resource= _bound >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring NOP (no op) instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", + "MetricGroup": "BvBO;Pipeline;Slots;TopdownL4;tma_L4_group;tma_oth= er_light_ops_group", + "MetricName": "tma_nop_instructions", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that requires the use of m= icrocode (slow nuke)", + "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.NUKE@ / (5 * cpu_a= tom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", + "MetricName": "tma_nuke", + "MetricThreshold": "(tma_nuke >0.05) & ((tma_machine_clears >0.05)= & ((tma_bad_speculation >0.15)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to other common frontend stalls not categor= ized.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.OTHER@ / (5 * cpu_atom@CP= U_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_other_fb", + "MetricThreshold": "(tma_other_fb >0.05) & ((tma_ifetch_bandwidth = >0.10) & ((tma_frontend_bound >0.20)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents the remaining light uo= ps fraction the CPU has executed - remaining means not covered by other sib= ling nodes", + "MetricExpr": "max(0, tma_light_operations - (tma_fp_arith + tma_i= nt_operations + tma_memory_operations + tma_fused_instructions + tma_non_fu= sed_branches))", + "MetricGroup": "Pipeline;Slots;TopdownL3;tma_L3_group;tma_light_op= erations_group", + "MetricName": "tma_other_light_ops", + "MetricThreshold": "tma_other_light_ops > 0.3 & tma_light_operatio= ns > 0.6", + "PublicDescription": "This metric represents the remaining light u= ops fraction the CPU has executed - remaining means not covered by other si= bling nodes. May undercount due to FMA double counting", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", + "MetricGroup": "BrMispredicts;BvIO;Slots;TopdownL3;tma_L3_group;tm= a_branch_mispredicts_group", + "MetricName": "tma_other_mispredicts", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", + "MetricGroup": "BvIO;Machine_Clears;Slots;TopdownL3;tma_L3_group;t= ma_machine_clears_group", + "MetricName": "tma_other_nukes", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Page Faults", + "MetricExpr": "99 * ASSISTS.PAGE_FAULT / tma_info_thread_slots", + "MetricGroup": "Slots_Estimated;TopdownL5;tma_L5_group;tma_assists= _group", + "MetricName": "tma_page_faults", + "MetricThreshold": "tma_page_faults > 0.05", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd b= ranch)", + "MetricExpr": "UOPS_DISPATCHED.PORT_0 / tma_info_core_core_clks", + "MetricGroup": "Compute;Core_Clocks;TopdownL6;tma_L6_group;tma_alu= _op_utilization_group;tma_issue2P", + "MetricName": "tma_port_0", + "MetricThreshold": "tma_port_0 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_12= 8b, tma_int_vector_256b, tma_port_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 1 (ALU)", + "MetricExpr": "UOPS_DISPATCHED.PORT_1 / tma_info_core_core_clks", + "MetricGroup": "Core_Clocks;TopdownL6;tma_L6_group;tma_alu_op_util= ization_group;tma_issue2P", + "MetricName": "tma_port_1", + "MetricThreshold": "tma_port_1 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b, tma_port_= 0, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simple= ALU)", + "MetricExpr": "UOPS_DISPATCHED.PORT_6 / tma_info_core_core_clks", + "MetricGroup": "Core_Clocks;TopdownL6;tma_L6_group;tma_alu_op_util= ization_group;tma_issue2P", + "MetricName": "tma_port_6", + "MetricThreshold": "tma_port_6 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128= b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", + "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_3_PORTS_UTIL)) / tm= a_info_thread_clks if ARITH.DIV_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - EXE_= ACTIVITY.BOUND_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EX= E_ACTIVITY.2_3_PORTS_UTIL) / tma_info_thread_clks)", + "MetricGroup": "Clocks;PortsUtil;TopdownL3;tma_L3_group;tma_core_b= ound_group", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", + "MetricExpr": "(EXE_ACTIVITY.EXE_BOUND_0_PORTS + max(RS.EMPTY_RESO= URCE - RESOURCE_STALLS.SCOREBOARD, 0)) / tma_info_thread_clks * (CYCLE_ACTI= VITY.STALLS_TOTAL - EXE_ACTIVITY.BOUND_ON_LOADS) / tma_info_thread_clks", + "MetricGroup": "Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_ports_= utilization_group", + "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", + "MetricGroup": "Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_issueL= 1;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_1", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", + "MetricGroup": "Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_issue2= P;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma= _int_vector_256b, tma_port_0, tma_port_1, tma_port_6", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", + "MetricGroup": "BvCB;Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_p= orts_utilization_group", + "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to wrong predecodes.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.PREDECODE@ / (5 * cpu_ato= m@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_predecode", + "MetricThreshold": "(tma_predecode >0.05) & ((tma_ifetch_bandwidth= >0.10) & ((tma_frontend_bound >0.20)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the physical register file unable to accep= t an entry (marble stalls)", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REGISTER@ / (5 * cpu_atom= @CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_register", + "MetricThreshold": "(tma_register >0.10) & ((tma_resource_bound >0= .20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the reorder buffer being full (ROB stalls)= ", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REORDER_BUFFER@ / (5 * cp= u_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_reorder_buffer", + "MetricThreshold": "(tma_reorder_buffer >0.10) & ((tma_resource_bo= und >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to a resource limitation", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALL@ / (5 * cpu_atom@CPU_= CLK_UNHALTED.CORE@) - tma_core_bound", + "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_resource_bound", + "MetricThreshold": "(tma_resource_bound >0.20) & ((tma_backend_bou= nd >0.10))", + "MetricgroupNoGroup": "TopdownL2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Default;TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "This category represents fraction of slots u= tilized by useful work i.e. issued uops that eventually get retired. Ideall= y; all pipeline slots would be attributed to the Retiring category. Retiri= ng of 100% would indicate the maximum Pipeline_Width throughput was achieve= d. Maximizing Retiring typically increases the Instructions-per-cycle (see= IPC metric). Note that a high Retiring value does not necessary mean there= is no room for more performance. For example; Heavy-operations or Microco= de Assists are categorized under Retiring. They often indicate suboptimal p= erformance and can often be optimized or avoided. Sample with: UOPS_RETIRED= .SLOTS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to scoreboards from the instruction queue (IQ= ), jump execution unit (JEU), or microcode sequencer (MS)", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.SERIALIZATION@ / (5 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_serialization", + "MetricThreshold": "(tma_serialization >0.10) & ((tma_resource_bou= nd >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU issue-pipeline was stalled due to serializing operations", + "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_thread_clks += tma_c02_wait", + "MetricGroup": "BvIO;Clocks;PortsUtil;TopdownL3;tma_L3_group;tma_c= ore_bound_group;tma_issueSO", + "MetricName": "tma_serializing_operation", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring Shuffle operations of 256-bit vector size (FP or Int= eger)", + "MetricExpr": "tma_light_operations * INT_VEC_RETIRED.SHUFFLES / (= tma_retiring * tma_info_thread_slots)", + "MetricGroup": "HPC;Pipeline;Slots;TopdownL4;tma_L4_group;tma_othe= r_light_ops_group", + "MetricName": "tma_shuffles_256b", + "MetricThreshold": "tma_shuffles_256b > 0.1 & tma_other_light_ops = > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to PAUSE Instructions", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "CPU_CLK_UNHALTED.PAUSE / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", + "MetricName": "tma_slow_pause", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.= PAUSE_INST", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles hand= ling memory load split accesses - load that cross 64-byte cache line bounda= ry", + "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", + "MetricGroup": "Clocks_Calculated;TopdownL4;tma_L4_group;tma_l1_bo= und_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents rate of split store ac= cesses", + "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_= clks", + "MetricGroup": "Core_Utilization;TopdownL4;tma_L4_group;tma_issueS= pSt;tma_store_bound_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", + "MetricExpr": "(XQ.FULL_CYCLES + L1D_PEND_MISS.L2_STALLS) / tma_in= fo_thread_clks", + "MetricGroup": "BvMB;Clocks;MemoryBW;Offcore;TopdownL4;tma_L4_grou= p;tma_issueBW;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often CPU was stall= ed due to RFO store memory accesses; RFO store issue a read-for-ownership = request before the write", + "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", + "MetricGroup": "MemoryBound;Stalls;TmaL3mem;TopdownL3;tma_L3_group= ;tma_memory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates fraction of cyc= les when the memory subsystem had loads blocked since they could not forwar= d data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", + "MetricGroup": "Clocks_Estimated;TopdownL4;tma_L4_group;tma_l1_bou= nd_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", + "MetricExpr": "(MEM_STORE_RETIRED.L2_HIT * 10 * (1 - MEM_INST_RETI= RED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_= LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE= _REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", + "MetricGroup": "BvML;Clocks_Estimated;LockCont;MemoryLat;Offcore;T= opdownL4;tma_L4_group;tma_issueRFO;tma_issueSL;tma_overlap;tma_store_bound_= group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Store operations", + "MetricExpr": "(UOPS_DISPATCHED.PORT_4_9 + UOPS_DISPATCHED.PORT_7_= 8) / (4 * tma_info_core_core_clks)", + "MetricGroup": "Core_Execution;TopdownL5;tma_L5_group;tma_ports_ut= ilized_3m_group", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Store operations. Sample with:= UOPS_DISPATCHED.PORT_7_8", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that requires the use of m= icrocode (slow nuke)", - "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.NUKE@ / (5 * cpu_a= tom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", - "MetricName": "tma_nuke", - "MetricThreshold": "tma_nuke > 0.05 & (tma_machine_clears > 0.05 &= tma_bad_speculation > 0.15)", + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the TLB was missed by store accesses, hitting in the second-l= evel TLB (STLB)", + "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL5;tma_L5_group;= tma_dtlb_store_group", + "MetricName": "tma_store_stlb_hit", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to other common frontend stalls not categor= ized.", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.OTHER@ / (5 * cpu_atom@CP= U_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", - "MetricName": "tma_other_fb", - "MetricThreshold": "tma_other_fb > 0.05 & (tma_ifetch_bandwidth > = 0.1 & tma_frontend_bound > 0.2)", + "BriefDescription": "This metric estimates the fraction of cycles = where the STLB was missed by store accesses, performing a hardware page wal= k", + "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", + "MetricGroup": "Clocks_Calculated;MemoryTLB;TopdownL5;tma_L5_group= ;tma_dtlb_store_group", + "MetricName": "tma_store_stlb_miss", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to wrong predecodes.", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.PREDECODE@ / (5 * cpu_ato= m@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", - "MetricName": "tma_predecode", - "MetricThreshold": "tma_predecode > 0.05 & (tma_ifetch_bandwidth >= 0.1 & tma_frontend_bound > 0.2)", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_store_stlb_miss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the physical register file unable to accep= t an entry (marble stalls)", - "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REGISTER@ / (5 * cpu_atom= @CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", - "MetricName": "tma_register", - "MetricThreshold": "tma_register > 0.1 & (tma_resource_bound > 0.2= & tma_backend_bound > 0.1)", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_store_stlb_miss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the reorder buffer being full (ROB stalls)= ", - "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REORDER_BUFFER@ / (5 * cp= u_atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", - "MetricName": "tma_reorder_buffer", - "MetricThreshold": "tma_reorder_buffer > 0.1 & (tma_resource_bound= > 0.2 & tma_backend_bound > 0.1)", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_store_stlb_miss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of cycles the core is stall= ed due to a resource limitation", - "MetricExpr": "tma_backend_bound - tma_core_bound", - "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_resource_bound", - "MetricThreshold": "tma_resource_bound > 0.2 & tma_backend_bound >= 0.1", - "MetricgroupNoGroup": "TopdownL2", + "BriefDescription": "This metric estimates how often CPU was stall= ed due to Streaming store memory accesses; Streaming store optimize out a = read request required by RFO stores", + "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", + "MetricGroup": "Clocks_Estimated;MemoryBW;Offcore;TopdownL4;tma_L4= _group;tma_issueSmSt;tma_store_bound_group", + "MetricName": "tma_streaming_stores", + "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that result = in retirement slots", - "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "cpu_atom@TOPDOWN_RETIRING.ALL@ / (5 * cpu_atom@CPU_= CLK_UNHALTED.CORE@)", - "MetricGroup": "Default;TopdownL1;tma_L1_group", - "MetricName": "tma_retiring", - "MetricThreshold": "tma_retiring > 0.75", - "MetricgroupNoGroup": "TopdownL1;Default", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to new branch address clears", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / tma_info_thread_cl= ks", + "MetricGroup": "BigFootprint;BvBC;Clocks;FetchLat;TopdownL4;tma_L4= _group;tma_branch_resteers_group", + "MetricName": "tma_unknown_branches", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to scoreboards from the instruction queue (IQ= ), jump execution unit (JEU), or microcode sequencer (MS)", - "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.SERIALIZATION@ / (5 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", - "MetricName": "tma_serialization", - "MetricThreshold": "tma_serialization > 0.1 & (tma_resource_bound = > 0.2 & tma_backend_bound > 0.1)", + "BriefDescription": "This metric serves as an approximation of leg= acy x87 usage", + "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", + "MetricGroup": "Compute;TopdownL4;Uops;tma_L4_group;tma_fp_arith_g= roup", + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -715,8 +2636,8 @@ "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", - "MetricExpr": "(cpu_core@UOPS_DISPATCHED.PORT_0@ + cpu_core@UOPS_D= ISPATCHED.PORT_1@ + cpu_core@UOPS_DISPATCHED.PORT_5_11@ + cpu_core@UOPS_DIS= PATCHED.PORT_6@) / (5 * tma_info_core_core_clks)", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + = UOPS_DISPATCHED.PORT_5_11 + UOPS_DISPATCHED.PORT_6) / (5 * tma_info_core_co= re_clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", "MetricThreshold": "tma_alu_op_utilization > 0.4", @@ -725,17 +2646,17 @@ }, { "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops delivered by the Microcode_Sequencer as a result of Assists= ", - "MetricExpr": "78 * cpu_core@ASSISTS.ANY@ / tma_info_thread_slots", + "MetricExpr": "78 * ASSISTS.ANY / tma_info_thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists.", - "MetricExpr": "63 * cpu_core@ASSISTS.SSE_AVX_MIX@ / tma_info_threa= d_slots", + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists", + "MetricExpr": "63 * ASSISTS.SSE_AVX_MIX / tma_info_thread_slots", "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_avx_assists", "MetricThreshold": "tma_avx_assists > 0.1", @@ -745,7 +2666,7 @@ { "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "cpu_core@topdown\\-be\\-bound@ / (cpu_core@topdown\= \-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retirin= g@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", + "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", "MetricGroup": "BvOB;Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", @@ -762,46 +2683,151 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%", "Unit": "cpu_core" }, + { + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Ret", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_fb_full)= ))", + "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_= bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_d= ram_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtlb_load + tma_st= ore_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_load= s + tma_fb_full)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_= l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_split_l= oads / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma= _lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bound * (tma_s= tore_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_split_stores / (tma_store_latency + tma_false_sha= ring + tma_split_stores + tma_streaming_stores + tma_dtlb_store)) + tma_mem= ory_bound * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound = + tma_dram_bound + tma_store_bound)) * (tma_store_latency / (tma_store_late= ncy + tma_false_sharing + tma_split_stores + tma_streaming_stores + tma_dtl= b_store)))", + "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D0x1@) * (t= ma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_res= teers + tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispr= edicts) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_bra= nches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma_= ms_switches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms / = (tma_mite + tma_dsb + tma_lsd + tma_ms))) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Fed;FetchBW;Frontend", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu_core@U= OPS_RETIRED.MS\\,cmask\\=3D0x1@) * (tma_fetch_latency * (tma_ms_switches + = tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma= _other_mispredicts / tma_branch_mispredicts) / (tma_mispredicts_resteers + = tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_itl= b_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switch= es) + tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_lsd + tma_ms= )) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / t= ma_other_nukes + tma_core_bound * (tma_serializing_operation + RS.EMPTY_RES= OURCE / tma_info_thread_clks * tma_ports_utilized_0) / (tma_divider + tma_s= erializing_operation + tma_ports_utilization) + tma_microcode_sequencer / (= tma_few_uops_instructions + tma_microcode_sequencer) * (tma_assists / tma_m= icrocode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_bound += tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_dt= lb_store / (tma_store_latency + tma_false_sharing + tma_split_stores + tma_= streaming_stores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Offcore", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Ret", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20", + "Unit": "cpu_core" + }, { "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", - "MetricExpr": "cpu_core@topdown\\-br\\-mispredict@ / (cpu_core@top= down\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-re= tiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", + "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * sl= ots", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueBM", "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredict= ions, tma_mispredicts_resteers", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost,= tma_mispredicts_resteers", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers", - "MetricExpr": "cpu_core@INT_MISC.CLEAR_RESTEER_CYCLES@ / tma_info_= thread_clks + tma_unknown_branches", + "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings).", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.C01@ / tma_info_thread_cl= ks", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C01 / tma_info_thread_clks", "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", "MetricName": "tma_c01_wait", - "MetricThreshold": "tma_c01_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_c01_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings).", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.C02@ / tma_info_thread_cl= ks", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C02 / tma_info_thread_clks", "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", "MetricName": "tma_c02_wait", - "MetricThreshold": "tma_c02_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_c02_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -810,28 +2836,82 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources. Sample with: FRONTEND_RETIRE= D.MS_FLOWS", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Machine Clears", - "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * cpu_core@INT_MISC.CLEAR_RESTEER_CYCLES@ / tma_info_thread_clks", + "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_core" }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "MetricExpr": "max(0, tma_icache_misses - tma_code_l2_miss)", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_hit", + "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_COD= E_RD / tma_info_thread_clks", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_miss", + "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_2M_= 4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_4K = / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", - "MetricExpr": "(25 * tma_info_system_core_frequency * (cpu_core@ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD@ * (cpu_core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP= _HITM@ / (cpu_core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM@ + cpu_core@OCR.DEM= AND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD@))) + 24 * tma_info_system_core_frequ= ency * cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@) * (1 + cpu_core@MEM_LOA= D_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_MISS@ / 2) / tma_info_thre= ad_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricExpr": "((28 * tma_info_system_core_frequency - 3 * tma_inf= o_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (OCR.DEMAND_= DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEM= AND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) + (27 * tma_info_system_core_frequ= ency - 3 * tma_info_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XSNP_M= ISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_i= nfo_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related m= etrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma_remote= _cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false= _sharing, tma_machine_clears", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -842,107 +2922,107 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", - "MetricExpr": "24 * tma_info_system_core_frequency * (cpu_core@MEM= _LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@ + cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_F= WD@ * (1 - cpu_core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM@ / (cpu_core@OCR.D= EMAND_DATA_RD.L3_HIT.SNOOP_HITM@ + cpu_core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP= _HIT_WITH_FWD@))) * (1 + cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_L= OAD_RETIRED.L1_MISS@ / 2) / tma_info_thread_clks", + "MetricExpr": "(27 * tma_info_system_core_frequency - 3 * tma_info= _system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD + MEM_LOAD_L= 3_HIT_RETIRED.XSNP_FWD * (1 - OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.D= EMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_= FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma= _info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu_core@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cp= u_core@INST_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu_core@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - = cpu_core@INST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks = / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", + "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles whe= re the Divider unit was active", - "MetricExpr": "cpu_core@ARITH.DIV_ACTIVE@ / tma_info_thread_clks", + "MetricExpr": "ARITH.DIV_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates how often the CPU was s= talled on accesses to external memory (DRAM) by loads", - "MetricExpr": "cpu_core@MEMORY_ACTIVITY.STALLS_L3_MISS@ / tma_info= _thread_clks", + "MetricExpr": "MEMORY_ACTIVITY.STALLS_L3_MISS / tma_info_thread_cl= ks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS_PS", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to DSB (decoded uop cache) fetch pipe= line", - "MetricExpr": "(cpu_core@IDQ.DSB_CYCLES_ANY@ - cpu_core@IDQ.DSB_CY= CLES_OK@) / tma_info_core_core_clks / 2", + "MetricExpr": "(IDQ.DSB_CYCLES_ANY - IDQ.DSB_CYCLES_OK) / tma_info= _core_core_clks / 2", "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to switches from DSB to MITE pipelines", - "MetricExpr": "cpu_core@DSB2MITE_SWITCHES.PENALTY_CYCLES@ / tma_in= fo_thread_clks", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "min(7 * cpu_core@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\= \=3D1@ + cpu_core@DTLB_LOAD_MISSES.WALK_ACTIVE@, max(cpu_core@CYCLE_ACTIVIT= Y.CYCLES_MEM_ANY@ - cpu_core@MEMORY_ACTIVITY.CYCLES_L1D_MISS@, 0)) / tma_in= fo_thread_clks", + "MetricExpr": "min(7 * cpu_core@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\= \=3D0x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY = - MEMORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store, tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_me= mory_synchronization", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(7 * cpu_core@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\= =3D1@ + cpu_core@DTLB_STORE_MISSES.WALK_ACTIVE@) / tma_info_core_core_clks", + "MetricExpr": "(7 * cpu_core@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\= =3D0x1@ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, = tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_memory_synchroniz= ation", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", - "MetricExpr": "28 * tma_info_system_core_frequency * cpu_core@OCR.= DEMAND_RFO.L3_HIT.SNOOP_HITM@ / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricExpr": "28 * tma_info_system_core_frequency * OCR.DEMAND_RF= O.L3_HIT.SNOOP_HITM / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing= , tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", - "MetricExpr": "cpu_core@L1D_PEND_MISS.FB_FULL@ / tma_info_thread_c= lks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricExpr": "L1D_PEND_MISS.FB_FULL / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_bottleneck_cache_memory_bandwi= dth, tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store= _latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy, tma_streaming_stores", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -953,28 +3033,28 @@ "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LA= TENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_sw= itches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tm= a_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend latency issues", - "MetricExpr": "cpu_core@topdown\\-fetch\\-lat@ / (cpu_core@topdown= \\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiri= ng@ + cpu_core@topdown\\-be\\-bound@) - cpu_core@INT_MISC.UOP_DROPPING@ / t= ma_info_thread_slots", + "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.U= OP_DROPPING / tma_info_thread_slots", "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend= _bound_group", "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or up to= ([SNB+] four; [ADL+] five) uops", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", "MetricExpr": "max(0, tma_heavy_operations - tma_microcode_sequenc= er)", "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", "MetricName": "tma_few_uops_instructions", "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or up t= o ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number o= f uops in such instructions. Related metrics: tma_decoder0_alone", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions.= Related metrics: tma_decoder0_alone", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -984,138 +3064,147 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Floating Point (FP) Assists", - "MetricExpr": "30 * cpu_core@ASSISTS.FP@ / tma_info_thread_slots", + "MetricExpr": "30 * ASSISTS.FP / tma_info_thread_slots", "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "MetricExpr": "ARITH.FPDIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_fp_divider", + "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric approximates arithmetic floating-= point (FP) scalar uops fraction the CPU has retired", - "MetricExpr": "cpu_core@FP_ARITH_INST_RETIRED.SCALAR@ / (tma_retir= ing * tma_info_thread_slots)", + "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_2= 56b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma= _port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric approximates arithmetic floating-= point (FP) vector uops fraction the CPU has retired aggregated across all v= ector widths", - "MetricExpr": "cpu_core@FP_ARITH_INST_RETIRED.VECTOR@ / (tma_retir= ing * tma_info_thread_slots)", + "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized= _2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 128-bit wide vectors", - "MetricExpr": "(cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE@= + cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE@) / (tma_retiring * tm= a_info_thread_slots)", + "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_= 2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 256-bit wide vectors", - "MetricExpr": "(cpu_core@FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE@= + cpu_core@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE@) / (tma_retiring * tm= a_info_thread_slots)", + "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_= 2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_int_vector_128b, tma_int_vector_256b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This category represents fraction of slots wh= ere the processor's Frontend undersupplies its Backend", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "cpu_core@topdown\\-fe\\-bound@ / (cpu_core@topdown\= \-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retirin= g@ + cpu_core@topdown\\-be\\-bound@) - cpu_core@INT_MISC.UOP_DROPPING@ / tm= a_info_thread_slots", + "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UO= P_DROPPING / tma_info_thread_slots", "MetricGroup": "BvFB;BvIO;Default;PGO;TmaL1;TopdownL1;tma_L1_group= ", "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions -- where one uop can represent mu= ltiple contiguous instructions", - "MetricExpr": "tma_light_operations * cpu_core@INST_RETIRED.MACRO_= FUSED@ / (tma_retiring * tma_info_thread_slots)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.MACRO_FUSED / (= tma_retiring * tma_info_thread_slots)", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_fused_instructions", "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions -- where one uop can represent m= ultiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of = legacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Ot= her_Light_Ops in MTL!)}", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", - "MetricExpr": "cpu_core@topdown\\-heavy\\-ops@ / (cpu_core@topdown= \\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiri= ng@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .). Sample with: UOPS= _RETIRED.HEAVY", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+]). Sample with: UOPS_RET= IRED.HEAVY", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", - "MetricExpr": "cpu_core@ICACHE_DATA.STALLS@ / tma_info_thread_clks= ", + "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA sl= ots wasted per non-speculative branch misprediction (retired JEClear)", - "MetricExpr": "tma_info_bottleneck_mispredictions * tma_info_threa= d_slots / cpu_core@BR_MISP_RETIRED.ALL_BRANCHES@ / 100", + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 6 / BR_MISP_RETIRED.ALL_BRANCHES / 100", "MetricGroup": "Bad;BrMispredicts;tma_issueBM", "MetricName": "tma_info_bad_spec_branch_misprediction_cost", - "PublicDescription": "Branch Misprediction Cost: Fraction of TMA s= lots wasted per non-speculative branch misprediction (retired JEClear). Rel= ated metrics: tma_branch_mispredicts, tma_info_bottleneck_mispredictions, t= ma_mispredicts_resteers", + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate).", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.COND_NTAKEN@", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional taken branches (lower number means higher occurrence rate).", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.COND_TAKEN@", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_taken < 200", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.INDIRECT@", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3", + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired mispredicts for retu= rn branches (lower number means higher occurrence rate).", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.RET@", + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_ret", "MetricThreshold": "tma_info_bad_spec_ipmisp_ret < 500", @@ -1123,15 +3212,15 @@ }, { "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.ALL_BRANCHES@", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", "MetricGroup": "Bad;BadSpec;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmispredict", "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200", "Unit": "cpu_core" }, { - "BriefDescription": "Speculative to Retired ratio of all clears (c= overing mispredicts and nukes)", - "MetricExpr": "cpu_core@INT_MISC.CLEARS_COUNT@ / (cpu_core@BR_MISP= _RETIRED.ALL_BRANCHES@ + cpu_core@MACHINE_CLEARS.COUNT@)", + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", + "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", "MetricGroup": "BrMispredicts", "MetricName": "tma_info_bad_spec_spec_clears_ratio", "Unit": "cpu_core" @@ -1146,8 +3235,8 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_lsd = + tma_mite)))", - "MetricGroup": "DSB;FetchBW;tma_issueFB", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_lsd + tma_ms)))", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", @@ -1155,7 +3244,7 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_lsd + tma_mite))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd + tma_ms))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -1164,142 +3253,36 @@ }, { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", - "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: ", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "tma_info_bottleneck_big_code", - "MetricThreshold": "tma_info_bottleneck_big_code > 20", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", - "MetricExpr": "100 * ((cpu_core@BR_INST_RETIRED.ALL_BRANCHES@ + 2 = * cpu_core@BR_INST_RETIRED.NEAR_CALL@ + cpu_core@INST_RETIRED.NOP@) / tma_i= nfo_thread_slots)", - "MetricGroup": "BvBO;Ret", - "MetricName": "tma_info_bottleneck_branching_overhead", - "MetricThreshold": "tma_info_bottleneck_branching_overhead > 5", - "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_fb_full + tma_l1_h= it_latency + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)))", - "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", - "MetricName": "tma_info_bottleneck_cache_memory_bandwidth", - "MetricThreshold": "tma_info_bottleneck_cache_memory_bandwidth > 2= 0", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_store_bound) + tma_memory_bound * (tma_store_bound / (tma_dram_bou= nd + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_= store_latency / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tm= a_store_latency + tma_streaming_stores)) + tma_memory_bound * (tma_l1_bound= / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store= _bound)) * (tma_l1_hit_latency / (tma_dtlb_load + tma_fb_full + tma_l1_hit_= latency + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)))", - "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", - "MetricName": "tma_info_bottleneck_cache_memory_latency", - "MetricThreshold": "tma_info_bottleneck_cache_memory_latency > 20", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_ports_utilization + tma_serializing_operation) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_ports_utilization + tma_serializin= g_operation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", - "MetricGroup": "BvCB;Cor;tma_issueComp", - "MetricName": "tma_info_bottleneck_compute_bound_est", - "MetricThreshold": "tma_info_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: ", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - (1 - c= pu_core@INST_RETIRED.REP_ITERATION@ / cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D= 1@) * (tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_cl= ears_resteers + tma_mispredicts_resteers * tma_other_mispredicts / tma_bran= ch_mispredicts) / (tma_clears_resteers + tma_mispredicts_resteers + tma_unk= nown_branches)) / (tma_branch_resteers + tma_dsb_switches + tma_icache_miss= es + tma_itlb_misses + tma_lcp + tma_ms_switches))) - tma_info_bottleneck_b= ig_code", - "MetricGroup": "BvFB;Fed;FetchBW;Frontend", - "MetricName": "tma_info_bottleneck_instruction_fetch_bw", - "MetricThreshold": "tma_info_bottleneck_instruction_fetch_bw > 20", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * ((1 - cpu_core@INST_RETIRED.REP_ITERATION@ / = cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D1@) * (tma_fetch_latency * (tma_ms_swi= tches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_restee= rs * tma_other_mispredicts / tma_branch_mispredicts) / (tma_clears_resteers= + tma_mispredicts_resteers + tma_unknown_branches)) / (tma_branch_resteers= + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_m= s_switches)) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_b= ranch_mispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other= _nukes / tma_other_nukes + tma_core_bound * (tma_serializing_operation + cp= u_core@RS.EMPTY\\,umask\\=3D1@ / tma_info_thread_clks * tma_ports_utilized_= 0) / (tma_divider + tma_ports_utilization + tma_serializing_operation) + tm= a_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequence= r) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", - "MetricName": "tma_info_bottleneck_irregular_overhead", - "MetricThreshold": "tma_info_bottleneck_irregular_overhead > 10", - "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= fb_full + tma_l1_hit_latency + tma_lock_latency + tma_split_loads + tma_sto= re_fwd_blk)) + tma_memory_bound * (tma_store_bound / (tma_dram_bound + tma_= l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_dtlb_stor= e / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_late= ncy + tma_streaming_stores)))", - "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_data_tlbs", - "MetricThreshold": "tma_info_bottleneck_memory_data_tlbs > 20", - "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store, tma_info_bottleneck_memory_synchronization", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * = tma_false_sharing / (tma_dtlb_store + tma_false_sharing + tma_split_stores = + tma_store_latency + tma_streaming_stores - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", - "MetricGroup": "BvMS;Mem;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_synchronization", - "MetricThreshold": "tma_info_bottleneck_memory_synchronization > 1= 0", - "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_dtlb_load, tma_dtlb_store, tma_info_bottleneck_me= mory_data_tlbs", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", - "MetricName": "tma_info_bottleneck_mispredictions", - "MetricThreshold": "tma_info_bottleneck_mispredictions > 20", - "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", - "MetricExpr": "100 - (tma_info_bottleneck_big_code + tma_info_bott= leneck_instruction_fetch_bw + tma_info_bottleneck_mispredictions + tma_info= _bottleneck_cache_memory_bandwidth + tma_info_bottleneck_cache_memory_laten= cy + tma_info_bottleneck_memory_data_tlbs + tma_info_bottleneck_memory_sync= hronization + tma_info_bottleneck_compute_bound_est + tma_info_bottleneck_i= rregular_overhead + tma_info_bottleneck_branching_overhead + tma_info_bottl= eneck_useful_work)", - "MetricGroup": "BvOB;Cor;Offcore", - "MetricName": "tma_info_bottleneck_other_bottlenecks", - "MetricThreshold": "tma_info_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls.", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", - "MetricExpr": "100 * (tma_retiring - (cpu_core@BR_INST_RETIRED.ALL= _BRANCHES@ + 2 * cpu_core@BR_INST_RETIRED.NEAR_CALL@ + cpu_core@INST_RETIRE= D.NOP@) / tma_info_thread_slots - tma_microcode_sequencer / (tma_few_uops_i= nstructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_seque= ncer) * tma_heavy_operations)", - "MetricGroup": "BvUW;Ret", - "MetricName": "tma_info_bottleneck_useful_work", - "MetricThreshold": "tma_info_bottleneck_useful_work > 20", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of branches that are CALL or RET", - "MetricExpr": "(cpu_core@BR_INST_RETIRED.NEAR_CALL@ + cpu_core@BR_= INST_RETIRED.NEAR_RETURN@) / cpu_core@BR_INST_RETIRED.ALL_BRANCHES@", + "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_R= ETURN) / BR_INST_RETIRED.ALL_BRANCHES", "MetricGroup": "Bad;Branches", "MetricName": "tma_info_branches_callret", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of branches that are non-taken condi= tionals", - "MetricExpr": "cpu_core@BR_INST_RETIRED.COND_NTAKEN@ / cpu_core@BR= _INST_RETIRED.ALL_BRANCHES@", + "MetricExpr": "BR_INST_RETIRED.COND_NTAKEN / BR_INST_RETIRED.ALL_B= RANCHES", "MetricGroup": "Bad;Branches;CodeGen;PGO", "MetricName": "tma_info_branches_cond_nt", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of branches that are taken condition= als", - "MetricExpr": "cpu_core@BR_INST_RETIRED.COND_TAKEN@ / cpu_core@BR_= INST_RETIRED.ALL_BRANCHES@", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN / BR_INST_RETIRED.ALL_BR= ANCHES", "MetricGroup": "Bad;Branches;CodeGen;PGO", "MetricName": "tma_info_branches_cond_tk", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of branches that are unconditional (= direct or indirect) jumps", - "MetricExpr": "(cpu_core@BR_INST_RETIRED.NEAR_TAKEN@ - cpu_core@BR= _INST_RETIRED.COND_TAKEN@ - 2 * cpu_core@BR_INST_RETIRED.NEAR_CALL@) / cpu_= core@BR_INST_RETIRED.ALL_BRANCHES@", + "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_= TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL) / BR_INST_RETIRED.ALL_BRANCHES", "MetricGroup": "Bad;Branches", "MetricName": "tma_info_branches_jump", "Unit": "cpu_core" @@ -1313,50 +3296,50 @@ }, { "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", - "MetricExpr": "(cpu_core@CPU_CLK_UNHALTED.DISTRIBUTED@ if #SMT_on = else tma_info_thread_clks)", + "MetricExpr": "(CPU_CLK_UNHALTED.DISTRIBUTED if #SMT_on else tma_i= nfo_thread_clks)", "MetricGroup": "SMT", "MetricName": "tma_info_core_core_clks", "Unit": "cpu_core" }, { "BriefDescription": "Instructions Per Cycle across hyper-threads (= per physical core)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / tma_info_core_core_clk= s", + "MetricExpr": "INST_RETIRED.ANY / tma_info_core_core_clks", "MetricGroup": "Ret;SMT;TmaL1;tma_L1_group", "MetricName": "tma_info_core_coreipc", "Unit": "cpu_core" }, { "BriefDescription": "uops Executed per Cycle", - "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / tma_info_thread_cl= ks", + "MetricExpr": "UOPS_EXECUTED.THREAD / tma_info_thread_clks", "MetricGroup": "Power", "MetricName": "tma_info_core_epc", "Unit": "cpu_core" }, { "BriefDescription": "Floating Point Operations Per Cycle", - "MetricExpr": "(cpu_core@FP_ARITH_INST_RETIRED.SCALAR@ + 2 * cpu_c= ore@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE@ + 4 * cpu_core@FP_ARITH_INST_= RETIRED.4_FLOPS@ + 8 * cpu_core@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE@) = / tma_info_core_core_clks", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / tma_info_core_core_clks", "MetricGroup": "Flops;Ret", "MetricName": "tma_info_core_flopc", "Unit": "cpu_core" }, { "BriefDescription": "Actual per-core usage of the Floating Point n= on-X87 execution units (regardless of precision or vector-width)", - "MetricExpr": "(cpu_core@FP_ARITH_DISPATCHED.PORT_0@ + cpu_core@FP= _ARITH_DISPATCHED.PORT_1@ + cpu_core@FP_ARITH_DISPATCHED.PORT_5@) / (2 * tm= a_info_core_core_clks)", + "MetricExpr": "(FP_ARITH_DISPATCHED.PORT_0 + FP_ARITH_DISPATCHED.P= ORT_1 + FP_ARITH_DISPATCHED.PORT_5) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n).", + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)", "Unit": "cpu_core" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / cpu_core@UOPS_EXEC= UTED.THREAD\\,cmask\\=3D1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu_core@UOPS_EXECUTED.THREA= D\\,cmask\\=3D0x1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of Uops delivered by the DSB (aka De= coded ICache; or Uop Cache)", - "MetricExpr": "cpu_core@IDQ.DSB_UOPS@ / cpu_core@UOPS_ISSUED.ANY@", + "MetricExpr": "IDQ.DSB_UOPS / UOPS_ISSUED.ANY", "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_frontend_dsb_coverage", "MetricThreshold": "tma_info_frontend_dsb_coverage < 0.7 & tma_inf= o_thread_ipc / 6 > 0.35", @@ -1364,29 +3347,29 @@ "Unit": "cpu_core" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", - "MetricExpr": "cpu_core@DSB2MITE_SWITCHES.PENALTY_CYCLES@ / cpu_co= re@DSB2MITE_SWITCHES.PENALTY_CYCLES\\,cmask\\=3D1\\,edge@", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu_core@DSB2MIT= E_SWITCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost", "Unit": "cpu_core" }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "cpu_core@UOPS_ISSUED.ANY@ / cpu_core@UOPS_ISSUED.AN= Y\\,cmask\\=3D1@", + "MetricExpr": "UOPS_ISSUED.ANY / cpu_core@UOPS_ISSUED.ANY\\,cmask\= \=3D0x1@", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc", "Unit": "cpu_core" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "cpu_core@ICACHE_DATA.STALLS@ / cpu_core@ICACHE_DATA= .STALLS\\,cmask\\=3D1\\,edge@", + "MetricExpr": "ICACHE_DATA.STALLS / cpu_core@ICACHE_DATA.STALLS\\,= cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per non-speculative DSB miss (lo= wer number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@FRONTEND_RETI= RED.ANY_DSB_MISS@", + "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", "MetricGroup": "DSBmiss;Fed", "MetricName": "tma_info_frontend_ipdsb_miss_ret", "MetricThreshold": "tma_info_frontend_ipdsb_miss_ret < 50", @@ -1394,50 +3377,57 @@ }, { "BriefDescription": "Instructions per speculative Unknown Branch M= isprediction (BAClear) (lower number means higher occurrence rate)", - "MetricExpr": "tma_info_inst_mix_instructions / cpu_core@BACLEARS.= ANY@", + "MetricExpr": "tma_info_inst_mix_instructions / BACLEARS.ANY", "MetricGroup": "Fed", "MetricName": "tma_info_frontend_ipunknown_branch", "Unit": "cpu_core" }, { "BriefDescription": "L2 cache true code cacheline misses per kilo = instruction", - "MetricExpr": "1e3 * cpu_core@FRONTEND_RETIRED.L2_MISS@ / cpu_core= @INST_RETIRED.ANY@", + "MetricExpr": "1e3 * FRONTEND_RETIRED.L2_MISS / INST_RETIRED.ANY", "MetricGroup": "IcMiss", "MetricName": "tma_info_frontend_l2mpki_code", "Unit": "cpu_core" }, { "BriefDescription": "L2 cache speculative code cacheline misses pe= r kilo instruction", - "MetricExpr": "1e3 * cpu_core@L2_RQSTS.CODE_RD_MISS@ / cpu_core@IN= ST_RETIRED.ANY@", + "MetricExpr": "1e3 * L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", "MetricGroup": "IcMiss", "MetricName": "tma_info_frontend_l2mpki_code_all", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of Uops delivered by the LSD (Loop S= tream Detector; aka Loop Cache)", - "MetricExpr": "cpu_core@LSD.UOPS@ / cpu_core@UOPS_ISSUED.ANY@", + "MetricExpr": "LSD.UOPS / UOPS_ISSUED.ANY", "MetricGroup": "Fed;LSD", "MetricName": "tma_info_frontend_lsd_coverage", "Unit": "cpu_core" }, + { + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc", + "Unit": "cpu_core" + }, { "BriefDescription": "Average number of cycles the front-end was de= layed due to an Unknown Branch detection", - "MetricExpr": "cpu_core@INT_MISC.UNKNOWN_BRANCH_CYCLES@ / cpu_core= @INT_MISC.UNKNOWN_BRANCH_CYCLES\\,cmask\\=3D1\\,edge@", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu_core@INT_MISC.= UNKNOWN_BRANCH_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "Fed", "MetricName": "tma_info_frontend_unknown_branch_cost", - "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node.", + "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node", "Unit": "cpu_core" }, { - "BriefDescription": "Branch instructions per taken branch.", - "MetricExpr": "cpu_core@BR_INST_RETIRED.ALL_BRANCHES@ / cpu_core@B= R_INST_RETIRED.NEAR_TAKEN@", + "BriefDescription": "Branch instructions per taken branch", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch", "Unit": "cpu_core" }, { "BriefDescription": "Total number of retired Instructions", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@", + "MetricExpr": "INST_RETIRED.ANY", "MetricGroup": "Summary;TmaL1;tma_L1_group", "MetricName": "tma_info_inst_mix_instructions", "PublicDescription": "Total number of retired Instructions. Sample= with: INST_RETIRED.PREC_DIST", @@ -1445,52 +3435,52 @@ }, { "BriefDescription": "Instructions per FP Arithmetic instruction (l= ower number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / (cpu_core@FP_ARITH_INS= T_RETIRED.SCALAR@ + cpu_core@FP_ARITH_INST_RETIRED.VECTOR@)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = FP_ARITH_INST_RETIRED.VECTOR)", "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW.", + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / (cpu_core@FP_ARITH_INS= T_RETIRED.128B_PACKED_DOUBLE@ + cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_= SINGLE@)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting.", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / (cpu_core@FP_ARITH_INS= T_RETIRED.256B_PACKED_DOUBLE@ + cpu_core@FP_ARITH_INST_RETIRED.256B_PACKED_= SINGLE@)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting.", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@FP_ARITH_INST= _RETIRED.SCALAR_DOUBLE@", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOU= BLE", "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting.", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@FP_ARITH_INST= _RETIRED.SCALAR_SINGLE@", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SIN= GLE", "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting.", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_INST_RETIR= ED.ALL_BRANCHES@", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", "MetricGroup": "Branches;Fed;InsType", "MetricName": "tma_info_inst_mix_ipbranch", "MetricThreshold": "tma_info_inst_mix_ipbranch < 8", @@ -1498,7 +3488,7 @@ }, { "BriefDescription": "Instructions per (near) call (lower number me= ans higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_INST_RETIR= ED.NEAR_CALL@", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_ipcall", "MetricThreshold": "tma_info_inst_mix_ipcall < 200", @@ -1506,7 +3496,7 @@ }, { "BriefDescription": "Instructions per Floating Point (FP) Operatio= n (lower number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / (cpu_core@FP_ARITH_INS= T_RETIRED.SCALAR@ + 2 * cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE@ = + 4 * cpu_core@FP_ARITH_INST_RETIRED.4_FLOPS@ + 8 * cpu_core@FP_ARITH_INST_= RETIRED.256B_PACKED_SINGLE@)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_= FLOPS + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_ipflop", "MetricThreshold": "tma_info_inst_mix_ipflop < 10", @@ -1514,7 +3504,7 @@ }, { "BriefDescription": "Instructions per Load (lower number means hig= her occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@MEM_INST_RETI= RED.ALL_LOADS@", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", "MetricGroup": "InsType", "MetricName": "tma_info_inst_mix_ipload", "MetricThreshold": "tma_info_inst_mix_ipload < 3", @@ -1522,14 +3512,14 @@ }, { "BriefDescription": "Instructions per PAUSE (lower number means hi= gher occurrence rate)", - "MetricExpr": "tma_info_inst_mix_instructions / cpu_core@CPU_CLK_U= NHALTED.PAUSE_INST@", + "MetricExpr": "tma_info_inst_mix_instructions / CPU_CLK_UNHALTED.P= AUSE_INST", "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_ippause", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per Store (lower number means hi= gher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@MEM_INST_RETI= RED.ALL_STORES@", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", "MetricGroup": "InsType", "MetricName": "tma_info_inst_mix_ipstore", "MetricThreshold": "tma_info_inst_mix_ipstore < 8", @@ -1537,7 +3527,7 @@ }, { "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@SW_PREFETCH_A= CCESS.T0\\,umask\\=3D0xF@", + "MetricExpr": "INST_RETIRED.ANY / SW_PREFETCH_ACCESS.ANY", "MetricGroup": "Prefetches", "MetricName": "tma_info_inst_mix_ipswpf", "MetricThreshold": "tma_info_inst_mix_ipswpf < 100", @@ -1545,10 +3535,10 @@ }, { "BriefDescription": "Instructions per taken branch", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_INST_RETIR= ED.NEAR_TAKEN@", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 13", + "MetricThreshold": "tma_info_inst_mix_iptb < 6 * 2 + 1", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp", "Unit": "cpu_core" }, @@ -1582,176 +3572,184 @@ }, { "BriefDescription": "Fill Buffer (FB) hits per kilo instructions f= or retired demand loads (L1D misses that merge into ongoing miss-handling e= ntries)", - "MetricExpr": "1e3 * cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@= INST_RETIRED.ANY@", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", "MetricGroup": "CacheHits;Mem", "MetricName": "tma_info_memory_fb_hpki", "Unit": "cpu_core" }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * cpu_core@L1D.REPLACEMENT@ / 1e9 / duration_tim= e", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw", "Unit": "cpu_core" }, { "BriefDescription": "L1 cache true misses per kilo instruction for= retired demand loads", - "MetricExpr": "1e3 * cpu_core@MEM_LOAD_RETIRED.L1_MISS@ / cpu_core= @INST_RETIRED.ANY@", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", "MetricGroup": "CacheHits;Mem", "MetricName": "tma_info_memory_l1mpki", "Unit": "cpu_core" }, { "BriefDescription": "L1 cache true misses per kilo instruction for= all demand loads (including speculative)", - "MetricExpr": "1e3 * cpu_core@L2_RQSTS.ALL_DEMAND_DATA_RD@ / cpu_c= ore@INST_RETIRED.ANY@", + "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.AN= Y", "MetricGroup": "CacheHits;Mem", "MetricName": "tma_info_memory_l1mpki_load", "Unit": "cpu_core" }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * cpu_core@L2_LINES_IN.ALL@ / 1e9 / duration_tim= e", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw", "Unit": "cpu_core" }, { "BriefDescription": "L2 cache hits per kilo instruction for all re= quest types (including speculative)", - "MetricExpr": "1e3 * (cpu_core@L2_RQSTS.REFERENCES@ - cpu_core@L2_= RQSTS.MISS@) / cpu_core@INST_RETIRED.ANY@", + "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_= RETIRED.ANY", "MetricGroup": "CacheHits;Mem", "MetricName": "tma_info_memory_l2hpki_all", "Unit": "cpu_core" }, { "BriefDescription": "L2 cache hits per kilo instruction for all de= mand loads (including speculative)", - "MetricExpr": "1e3 * cpu_core@L2_RQSTS.DEMAND_DATA_RD_HIT@ / cpu_c= ore@INST_RETIRED.ANY@", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.AN= Y", "MetricGroup": "CacheHits;Mem", "MetricName": "tma_info_memory_l2hpki_load", "Unit": "cpu_core" }, { "BriefDescription": "L2 cache true misses per kilo instruction for= retired demand loads", - "MetricExpr": "1e3 * cpu_core@MEM_LOAD_RETIRED.L2_MISS@ / cpu_core= @INST_RETIRED.ANY@", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", "MetricGroup": "Backend;CacheHits;Mem", "MetricName": "tma_info_memory_l2mpki", "Unit": "cpu_core" }, { "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all request types (including speculative)", - "MetricExpr": "1e3 * cpu_core@L2_RQSTS.MISS@ / cpu_core@INST_RETIR= ED.ANY@", + "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", "MetricGroup": "CacheHits;Mem;Offcore", "MetricName": "tma_info_memory_l2mpki_all", "Unit": "cpu_core" }, { "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all demand loads (including speculative)", - "MetricExpr": "1e3 * cpu_core@L2_RQSTS.DEMAND_DATA_RD_MISS@ / cpu_= core@INST_RETIRED.ANY@", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.A= NY", "MetricGroup": "CacheHits;Mem", "MetricName": "tma_info_memory_l2mpki_load", "Unit": "cpu_core" }, { "BriefDescription": "Offcore requests (L2 cache miss) per kilo ins= truction for demand RFOs", - "MetricExpr": "1e3 * cpu_core@L2_RQSTS.RFO_MISS@ / cpu_core@INST_R= ETIRED.ANY@", + "MetricExpr": "1e3 * L2_RQSTS.RFO_MISS / INST_RETIRED.ANY", "MetricGroup": "CacheMisses;Offcore", "MetricName": "tma_info_memory_l2mpki_rfo", "Unit": "cpu_core" }, { "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", - "MetricExpr": "64 * cpu_core@OFFCORE_REQUESTS.ALL_REQUESTS@ / 1e9 = / duration_time", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", "MetricGroup": "Mem;MemoryBW;Offcore", "MetricName": "tma_info_memory_l3_cache_access_bw", "Unit": "cpu_core" }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * cpu_core@LONGEST_LAT_CACHE.MISS@ / 1e9 / durat= ion_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw", "Unit": "cpu_core" }, { "BriefDescription": "L3 cache true misses per kilo instruction for= retired demand loads", - "MetricExpr": "1e3 * cpu_core@MEM_LOAD_RETIRED.L3_MISS@ / cpu_core= @INST_RETIRED.ANY@", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", "MetricGroup": "Mem", "MetricName": "tma_info_memory_l3mpki", "Unit": "cpu_core" }, { "BriefDescription": "Average Parallel L2 cache miss data reads", - "MetricExpr": "cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DATA_RD@ / cp= u_core@OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD@", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DATA_RD / OFFCORE_REQU= ESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", "MetricGroup": "Memory_BW;Offcore", "MetricName": "tma_info_memory_latency_data_l2_mlp", "Unit": "cpu_core" }, { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", - "MetricExpr": "cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_R= D@ / cpu_core@OFFCORE_REQUESTS.DEMAND_DATA_RD@", - "MetricGroup": "Memory_Lat;Offcore", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency", "Unit": "cpu_core" }, { "BriefDescription": "Average Parallel L2 cache miss demand Loads", - "MetricExpr": "cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_R= D@ / cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D1@", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu_c= ore@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", "MetricGroup": "Memory_BW;Offcore", "MetricName": "tma_info_memory_latency_load_l2_mlp", "Unit": "cpu_core" }, { "BriefDescription": "Average Latency for L3 cache miss demand Load= s", - "MetricExpr": "cpu_core@OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAN= D_DATA_RD@ / cpu_core@OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD@", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAND_DATA_RD= / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", "MetricGroup": "Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l3_miss_latency", "Unit": "cpu_core" }, { "BriefDescription": "Actual Average Latency for L1 data-cache miss= demand load operations (in core cycles)", - "MetricExpr": "cpu_core@L1D_PEND_MISS.PENDING@ / cpu_core@MEM_LOAD= _COMPLETED.L1_MISS_ANY@", + "MetricExpr": "L1D_PEND_MISS.PENDING / MEM_LOAD_COMPLETED.L1_MISS_= ANY", "MetricGroup": "Mem;MemoryBound;MemoryLat", "MetricName": "tma_info_memory_load_miss_real_latency", "Unit": "cpu_core" }, { "BriefDescription": "\"Bus lock\" per kilo instruction", - "MetricExpr": "1e3 * cpu_core@SQ_MISC.BUS_LOCK@ / cpu_core@INST_RE= TIRED.ANY@", + "MetricExpr": "1e3 * SQ_MISC.BUS_LOCK / INST_RETIRED.ANY", "MetricGroup": "Mem", "MetricName": "tma_info_memory_mix_bus_lock_pki", "Unit": "cpu_core" }, { "BriefDescription": "Un-cacheable retired load per kilo instructio= n", - "MetricExpr": "1e3 * cpu_core@MEM_LOAD_MISC_RETIRED.UC@ / cpu_core= @INST_RETIRED.ANY@", + "MetricExpr": "1e3 * MEM_LOAD_MISC_RETIRED.UC / INST_RETIRED.ANY", "MetricGroup": "Mem", "MetricName": "tma_info_memory_mix_uc_load_pki", "Unit": "cpu_core" }, { "BriefDescription": "Memory-Level-Parallelism (average number of L= 1 miss demand load when there is at least one such miss", - "MetricExpr": "cpu_core@L1D_PEND_MISS.PENDING@ / cpu_core@L1D_PEND= _MISS.PENDING_CYCLES@", + "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLE= S", "MetricGroup": "Mem;MemoryBW;MemoryBound", "MetricName": "tma_info_memory_mlp", "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)", "Unit": "cpu_core" }, + { + "BriefDescription": "Rate of L2 HW prefetched lines that were not = used by demand accesses", + "MetricExpr": "L2_LINES_OUT.USELESS_HWPF / (L2_LINES_OUT.SILENT + = L2_LINES_OUT.NON_SILENT)", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_memory_prefetches_useless_hwpf", + "MetricThreshold": "tma_info_memory_prefetches_useless_hwpf > 0.15= ", + "Unit": "cpu_core" + }, { "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", - "MetricExpr": "1e3 * cpu_core@ITLB_MISSES.WALK_COMPLETED@ / cpu_co= re@INST_RETIRED.ANY@", + "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", "MetricGroup": "Fed;MemoryTLB", "MetricName": "tma_info_memory_tlb_code_stlb_mpki", "Unit": "cpu_core" }, { "BriefDescription": "STLB (2nd level TLB) data load speculative mi= sses per kilo instruction (misses of any page-size that complete the page w= alk)", - "MetricExpr": "1e3 * cpu_core@DTLB_LOAD_MISSES.WALK_COMPLETED@ / c= pu_core@INST_RETIRED.ANY@", + "MetricExpr": "1e3 * DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRE= D.ANY", "MetricGroup": "Mem;MemoryTLB", "MetricName": "tma_info_memory_tlb_load_stlb_mpki", "Unit": "cpu_core" }, { "BriefDescription": "Utilization of the core's Page Walker(s) serv= ing STLB misses triggered by instruction/Load/Store accesses", - "MetricExpr": "(cpu_core@ITLB_MISSES.WALK_PENDING@ + cpu_core@DTLB= _LOAD_MISSES.WALK_PENDING@ + cpu_core@DTLB_STORE_MISSES.WALK_PENDING@) / (4= * tma_info_core_core_clks)", + "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_P= ENDING + DTLB_STORE_MISSES.WALK_PENDING) / (4 * tma_info_core_core_clks)", "MetricGroup": "Mem;MemoryTLB", "MetricName": "tma_info_memory_tlb_page_walks_utilization", "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5", @@ -1759,58 +3757,58 @@ }, { "BriefDescription": "STLB (2nd level TLB) data store speculative m= isses per kilo instruction (misses of any page-size that complete the page = walk)", - "MetricExpr": "1e3 * cpu_core@DTLB_STORE_MISSES.WALK_COMPLETED@ / = cpu_core@INST_RETIRED.ANY@", + "MetricExpr": "1e3 * DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIR= ED.ANY", "MetricGroup": "Mem;MemoryTLB", "MetricName": "tma_info_memory_tlb_store_stlb_mpki", "Unit": "cpu_core" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / (cpu_core@UOPS_EXE= CUTED.CORE_CYCLES_GE_1@ / 2 if #SMT_on else cpu_core@UOPS_EXECUTED.THREAD\\= ,cmask\\=3D1@)", + "BriefDescription": "", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu_core@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute", "Unit": "cpu_core" }, { "BriefDescription": "Average number of uops fetched from DSB per c= ycle", - "MetricExpr": "cpu_core@IDQ.DSB_UOPS@ / cpu_core@IDQ.DSB_CYCLES_AN= Y@", + "MetricExpr": "IDQ.DSB_UOPS / IDQ.DSB_CYCLES_ANY", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_pipeline_fetch_dsb", "Unit": "cpu_core" }, { "BriefDescription": "Average number of uops fetched from LSD per c= ycle", - "MetricExpr": "cpu_core@LSD.UOPS@ / cpu_core@LSD.CYCLES_ACTIVE@", + "MetricExpr": "LSD.UOPS / LSD.CYCLES_ACTIVE", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_pipeline_fetch_lsd", "Unit": "cpu_core" }, { "BriefDescription": "Average number of uops fetched from MITE per = cycle", - "MetricExpr": "cpu_core@IDQ.MITE_UOPS@ / cpu_core@IDQ.MITE_CYCLES_= ANY@", + "MetricExpr": "IDQ.MITE_UOPS / IDQ.MITE_CYCLES_ANY", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_pipeline_fetch_mite", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per a microcode Assist invocatio= n", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@ASSISTS.ANY@", + "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)", "Unit": "cpu_core" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire", "Unit": "cpu_core" }, { "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", - "MetricExpr": "cpu_core@INST_RETIRED.REP_ITERATION@ / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D1@", + "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu_core@UOPS_RETIRED.= SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "MicroSeq;Pipeline;Ret", "MetricName": "tma_info_pipeline_strings_cycles", "MetricThreshold": "tma_info_pipeline_strings_cycles > 0.1", @@ -1818,7 +3816,7 @@ }, { "BriefDescription": "Fraction of cycles the processor is waiting y= et unhalted; covering legacy PAUSE instruction, as well as C0.1 / C0.2 powe= r-performance optimized states", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.C0_WAIT@ / tma_info_threa= d_clks", + "MetricExpr": "CPU_CLK_UNHALTED.C0_WAIT / tma_info_thread_clks", "MetricGroup": "C0Wait", "MetricName": "tma_info_system_c0_wait", "MetricThreshold": "tma_info_system_c0_wait > 0.05", @@ -1826,7 +3824,7 @@ }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency", "Unit": "cpu_core" @@ -1840,22 +3838,22 @@ }, { "BriefDescription": "Average number of utilized CPUs", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.REF_TSC@ / TSC", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", "MetricGroup": "Summary", "MetricName": "tma_info_system_cpus_utilized", "Unit": "cpu_core" }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / duration_time / 1e3", + "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / tma_info_system_time / 1e3", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", - "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_bottlenec= k_cache_memory_bandwidth, tma_mem_bandwidth, tma_sq_full", + "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_bottleneck_cache_memory_ban= dwidth, tma_fb_full, tma_mem_bandwidth, tma_sq_full", "Unit": "cpu_core" }, { "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(cpu_core@FP_ARITH_INST_RETIRED.SCALAR@ + 2 * cpu_c= ore@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE@ + 4 * cpu_core@FP_ARITH_INST_= RETIRED.4_FLOPS@ + 8 * cpu_core@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE@) = / 1e9 / duration_time", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / tma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width", @@ -1863,22 +3861,23 @@ }, { "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_INST_RETIR= ED.FAR_BRANCH@u", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6", + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000", "Unit": "cpu_core" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.THREAD_P@k / cpu_core@INS= T_RETIRED.ANY_P@k", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.THREAD_P@k / cpu_core@CPU= _CLK_UNHALTED.THREAD@", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THRE= AD", "MetricGroup": "OS", "MetricName": "tma_info_system_kernel_utilization", "MetricThreshold": "tma_info_system_kernel_utilization > 0.05", @@ -1901,9 +3900,24 @@ "PublicDescription": "Average latency of data read request to exte= rnal memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetche= s. ([RKL+]memory-controller only)", "Unit": "cpu_core" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "power@energy\\-pkg@ * 61 / (tma_info_system_time * = 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power", + "Unit": "cpu_core" + }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", - "MetricExpr": "(1 - cpu_core@CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE@ /= cpu_core@CPU_CLK_UNHALTED.REF_DISTRIBUTED@ if #SMT_on else 0)", + "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_U= NHALTED.REF_DISTRIBUTED if #SMT_on else 0)", "MetricGroup": "SMT", "MetricName": "tma_info_system_smt_2t_utilization", "Unit": "cpu_core" @@ -1915,16 +3929,24 @@ "MetricName": "tma_info_system_socket_clks", "Unit": "cpu_core" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1", + "Unit": "cpu_core" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", - "MetricExpr": "tma_info_thread_clks / cpu_core@CPU_CLK_UNHALTED.RE= F_TSC@", + "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", "MetricGroup": "Power", "MetricName": "tma_info_system_turbo_utilization", "Unit": "cpu_core" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.THREAD@", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks", "Unit": "cpu_core" @@ -1934,40 +3956,41 @@ "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr", "Unit": "cpu_core" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", - "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / cpu_core@UOPS_ISSU= ED.ANY@", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage.", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage", "Unit": "cpu_core" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / tma_info_thread_clks", + "MetricExpr": "INST_RETIRED.ANY / tma_info_thread_clks", "MetricGroup": "Ret;Summary", "MetricName": "tma_info_thread_ipc", "Unit": "cpu_core" }, { "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "cpu_core@TOPDOWN.SLOTS@", + "MetricExpr": "slots", "MetricGroup": "TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", - "MetricExpr": "(tma_info_thread_slots / (cpu_core@TOPDOWN.SLOTS@ /= 2) if #SMT_on else 1)", + "MetricExpr": "(tma_info_thread_slots / (slots / 2) if #SMT_on els= e 1)", "MetricGroup": "SMT;TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots_utilization", "Unit": "cpu_core" }, { "BriefDescription": "Uops Per Instruction", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@INS= T_RETIRED.ANY@", + "MetricExpr": "tma_retiring * tma_info_thread_slots / INST_RETIRED= .ANY", "MetricGroup": "Pipeline;Ret;Retire", "MetricName": "tma_info_thread_uoppi", "MetricThreshold": "tma_info_thread_uoppi > 1.05", @@ -1975,10 +3998,19 @@ }, { "BriefDescription": "Uops per taken branch", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@BR_= INST_RETIRED.NEAR_TAKEN@", + "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 9", + "MetricThreshold": "tma_info_thread_uptb < 6 * 1.5", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "MetricExpr": "tma_divider - tma_fp_divider", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_int_divider", + "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { @@ -1987,114 +4019,124 @@ "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_int_operations", "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operation= s > 0.6", - "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain.", + "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents 128-bit vector Integer= ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the= CPU has retired", - "MetricExpr": "(cpu_core@INT_VEC_RETIRED.ADD_128@ + cpu_core@INT_V= EC_RETIRED.VNNI_128@) / (tma_retiring * tma_info_thread_slots)", + "MetricExpr": "(INT_VEC_RETIRED.ADD_128 + INT_VEC_RETIRED.VNNI_128= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", "MetricName": "tma_int_vector_128b", - "MetricThreshold": "tma_int_vector_128b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", - "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_256b, tma= _port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_int_vector_128b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_int_vector_256b, tma_port_0, tma_port_1,= tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents 256-bit vector Integer= ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fraction= the CPU has retired", - "MetricExpr": "(cpu_core@INT_VEC_RETIRED.ADD_256@ + cpu_core@INT_V= EC_RETIRED.MUL_256@ + cpu_core@INT_VEC_RETIRED.VNNI_256@) / (tma_retiring *= tma_info_thread_slots)", + "MetricExpr": "(INT_VEC_RETIRED.ADD_256 + INT_VEC_RETIRED.MUL_256 = + INT_VEC_RETIRED.VNNI_256) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", "MetricName": "tma_int_vector_256b", - "MetricThreshold": "tma_int_vector_256b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", - "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b,= tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_int_vector_256b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_port_0, tma_por= t_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", - "MetricExpr": "cpu_core@ICACHE_TAG.STALLS@ / tma_info_thread_clks", + "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", - "MetricExpr": "max((cpu_core@EXE_ACTIVITY.BOUND_ON_LOADS@ - cpu_co= re@MEMORY_ACTIVITY.STALLS_L1D_MISS@) / tma_info_thread_clks, 0)", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", + "MetricExpr": "max((EXE_ACTIVITY.BOUND_ON_LOADS - MEMORY_ACTIVITY.= STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB= _HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_micr= ocode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric roughly estimates fraction of cyc= les with demand load accesses that hit the L1 cache", - "MetricExpr": "min(2 * (cpu_core@MEM_INST_RETIRED.ALL_LOADS@ - cpu= _core@MEM_LOAD_RETIRED.FB_HIT@ - cpu_core@MEM_LOAD_RETIRED.L1_MISS@) * 20 /= 100, max(cpu_core@CYCLE_ACTIVITY.CYCLES_MEM_ANY@ - cpu_core@MEMORY_ACTIVIT= Y.CYCLES_L1D_MISS@, 0)) / tma_info_thread_clks", + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", + "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - MEMORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", - "MetricName": "tma_l1_hit_latency", - "MetricThreshold": "tma_l1_hit_latency > 0.1 & (tma_l1_bound > 0.1= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles with demand load accesses that hit the L1 cache. The short latency of = the L1 data cache may be exposed in pointer-chasing memory access patterns = as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", - "MetricExpr": "(cpu_core@MEMORY_ACTIVITY.STALLS_L1D_MISS@ - cpu_co= re@MEMORY_ACTIVITY.STALLS_L2_MISS@) / tma_info_thread_clks", + "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L1D_MISS - MEMORY_ACTIVITY.= STALLS_L2_MISS) / tma_info_thread_clks", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT_PS", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "3 * tma_info_system_core_frequency * MEM_LOAD_RETIR= ED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / = tma_info_thread_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates how often the CPU was s= talled due to loads accesses to L3 cache or contended with a sibling Core", - "MetricExpr": "(cpu_core@MEMORY_ACTIVITY.STALLS_L2_MISS@ - cpu_cor= e@MEMORY_ACTIVITY.STALLS_L3_MISS@) / tma_info_thread_clks", + "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L2_MISS - MEMORY_ACTIVITY.S= TALLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "9 * tma_info_system_core_frequency * (cpu_core@MEM_= LOAD_RETIRED.L3_HIT@ * (1 + cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@ME= M_LOAD_RETIRED.L1_MISS@ / 2)) / tma_info_thread_clks", + "MetricExpr": "(12 * tma_info_system_core_frequency - 3 * tma_info= _system_core_frequency) * (MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.= FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_info_bottleneck_cache_memory_latency, tma_mem_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles CPU= was stalled due to Length Changing Prefixes (LCPs)", - "MetricExpr": "cpu_core@DECODE.LCP@ / tma_info_thread_clks", + "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Load operations", - "MetricExpr": "cpu_core@UOPS_DISPATCHED.PORT_2_3_10@ / (3 * tma_in= fo_core_core_clks)", + "MetricExpr": "UOPS_DISPATCHED.PORT_2_3_10 / (3 * tma_info_core_co= re_clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_load_op_utilization", "MetricThreshold": "tma_load_op_utilization > 0.6", @@ -2107,36 +4149,63 @@ "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by load accesses, performing a= hardware page walk", - "MetricExpr": "cpu_core@DTLB_LOAD_MISSES.WALK_ACTIVE@ / tma_info_t= hread_clks", + "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", - "MetricExpr": "(16 * max(0, cpu_core@MEM_INST_RETIRED.LOCK_LOADS@ = - cpu_core@L2_RQSTS.ALL_RFO@) + cpu_core@MEM_INST_RETIRED.LOCK_LOADS@ / cpu= _core@MEM_INST_RETIRED.ALL_STORES@ * (10 * cpu_core@L2_RQSTS.RFO_HIT@ + min= (cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFFCORE_REQUESTS_OUTSTANDING.C= YCLES_WITH_DEMAND_RFO@))) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricExpr": "(16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10= * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAN= DING.CYCLES_WITH_DEMAND_RFO))) / tma_info_thread_clks", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to LSD (Loop Stream Detector) unit", - "MetricExpr": "(cpu_core@LSD.CYCLES_ACTIVE@ - cpu_core@LSD.CYCLES_= OK@) / tma_info_core_core_clks / 2", + "MetricExpr": "(LSD.CYCLES_ACTIVE - LSD.CYCLES_OK) / tma_info_core= _core_clks / 2", "MetricGroup": "FetchBW;LSD;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_lsd", "MetricThreshold": "tma_lsd > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2147,54 +4216,54 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFF= CORE_REQUESTS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clk= s", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu_core@OFFCORE_REQUE= STS_OUTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_bottleneck_cache= _memory_bandwidth, tma_info_system_dram_bw_use, tma_sq_full", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the performance was likely hurt due to latency from external memory - DRA= M ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFF= CORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD@) / tma_info_thread_clks - tm= a_mem_bandwidth", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_info_bottleneck_cache_memory_latency, tma_l3_hit_la= tency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", - "MetricExpr": "cpu_core@topdown\\-mem\\-bound@ / (cpu_core@topdown= \\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiri= ng@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", + "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_b= ound_group", "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions.", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "13 * cpu_core@MISC2_RETIRED.LFENCE@ / tma_info_thre= ad_clks", + "MetricExpr": "13 * MISC2_RETIRED.LFENCE / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_memory_fence", - "MetricThreshold": "tma_memory_fence > 0.05 & (tma_serializing_ope= ration > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_memory_fence > 0.05 & tma_serializing_oper= ation > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", - "MetricExpr": "tma_light_operations * cpu_core@MEM_UOP_RETIRED.ANY= @ / (tma_retiring * tma_info_thread_slots)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "MetricExpr": "tma_light_operations * MEM_UOP_RETIRED.ANY / (tma_r= etiring * tma_info_thread_slots)", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_memory_operations", "MetricThreshold": "tma_memory_operations > 0.1 & tma_light_operat= ions > 0.6", @@ -2203,27 +4272,27 @@ }, { "BriefDescription": "This metric represents fraction of slots the = CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", - "MetricExpr": "cpu_core@UOPS_RETIRED.MS@ / tma_info_thread_slots", + "MetricExpr": "UOPS_RETIRED.MS / tma_info_thread_slots", "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", "MetricName": "tma_microcode_sequencer", "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", - "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_clears_reste= ers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_clea= rs, tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_bottleneck_i= rregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, t= ma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Branch Misprediction= at execution stage", - "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * cpu_= core@INT_MISC.CLEAR_RESTEER_CYCLES@ / tma_info_thread_clks", + "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * INT_= MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_branch_mispredicts, tma_info_bad_spec_branch_misprediction_cost= , tma_info_bottleneck_mispredictions", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the MITE pipeline (the legacy deco= de pipeline)", - "MetricExpr": "(cpu_core@IDQ.MITE_CYCLES_ANY@ - cpu_core@IDQ.MITE_= CYCLES_OK@) / tma_info_core_core_clks / 2", + "MetricExpr": "(IDQ.MITE_CYCLES_ANY - IDQ.MITE_CYCLES_OK) / tma_in= fo_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", "MetricName": "tma_mite", "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", @@ -2232,41 +4301,50 @@ "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", - "MetricExpr": "160 * cpu_core@ASSISTS.SSE_AVX_MIX@ / tma_info_thre= ad_clks", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "MetricExpr": "160 * ASSISTS.SSE_AVX_MIX / tma_info_thread_clks", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "MetricExpr": "max(IDQ.MS_CYCLES_ANY, cpu_core@UOPS_RETIRED.MS\\,c= mask\\=3D0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY)) / tma_info_core_cor= e_clks / 2", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", + "MetricName": "tma_ms", + "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", - "MetricExpr": "3 * cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D1\\,edge@ = / (cpu_core@UOPS_RETIRED.SLOTS@ / cpu_core@UOPS_ISSUED.ANY@) / tma_info_thr= ead_clks", + "MetricExpr": "3 * cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D0x1\\,edge= \\=3D0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY) / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: FRONTEND_RETIRED.MS_FLOWS. Related metrics: tm= a_clears_resteers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tm= a_machine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializ= ing_operation", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: FRONTEND_RETIRED.MS_FLOWS. Related metrics: tm= a_bottleneck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_mac= hine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_o= peration", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions that were not fused", - "MetricExpr": "tma_light_operations * (cpu_core@BR_INST_RETIRED.AL= L_BRANCHES@ - cpu_core@INST_RETIRED.MACRO_FUSED@) / (tma_retiring * tma_inf= o_thread_slots)", + "MetricExpr": "tma_light_operations * (BR_INST_RETIRED.ALL_BRANCHE= S - INST_RETIRED.MACRO_FUSED) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_non_fused_branches", "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused.", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring NOP (no op) instructions", - "MetricExpr": "tma_light_operations * cpu_core@INST_RETIRED.NOP@ /= (tma_retiring * tma_info_thread_slots)", + "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2282,89 +4360,89 @@ "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", - "MetricExpr": "max(tma_branch_mispredicts * (1 - cpu_core@BR_MISP_= RETIRED.ALL_BRANCHES@ / (cpu_core@INT_MISC.CLEARS_COUNT@ - cpu_core@MACHINE= _CLEARS.COUNT@)), 0.0001)", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", - "MetricExpr": "max(tma_machine_clears * (1 - cpu_core@MACHINE_CLEA= RS.MEMORY_ORDERING@ / cpu_core@MACHINE_CLEARS.COUNT@), 0.0001)", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Page Faults", - "MetricExpr": "99 * cpu_core@ASSISTS.PAGE_FAULT@ / tma_info_thread= _slots", + "MetricExpr": "99 * ASSISTS.PAGE_FAULT / tma_info_thread_slots", "MetricGroup": "TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_page_faults", "MetricThreshold": "tma_page_faults > 0.05", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost.", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd b= ranch)", - "MetricExpr": "cpu_core@UOPS_DISPATCHED.PORT_0@ / tma_info_core_co= re_clks", + "MetricExpr": "UOPS_DISPATCHED.PORT_0 / tma_info_core_core_clks", "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_int_vector_128b, tma_int_vector_256b, tma_port_1, tma_port_5, tma_po= rt_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_12= 8b, tma_int_vector_256b, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 1 (ALU)", - "MetricExpr": "cpu_core@UOPS_DISPATCHED.PORT_1@ / tma_info_core_co= re_clks", + "MetricExpr": "UOPS_DISPATCHED.PORT_1 / tma_info_core_core_clks", "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_ve= ctor_256b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b, tma_port_= 0, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simple= ALU)", - "MetricExpr": "cpu_core@UOPS_DISPATCHED.PORT_6@ / tma_info_core_co= re_clks", + "MetricExpr": "UOPS_DISPATCHED.PORT_6 / tma_info_core_core_clks", "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_6. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_por= t_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128= b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", - "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (cp= u_core@EXE_ACTIVITY.1_PORTS_UTIL@ + tma_retiring * cpu_core@EXE_ACTIVITY.2_= PORTS_UTIL\\,umask\\=3D0xc@)) / tma_info_thread_clks if cpu_core@ARITH.DIV_= ACTIVE@ < cpu_core@CYCLE_ACTIVITY.STALLS_TOTAL@ - cpu_core@EXE_ACTIVITY.BOU= ND_ON_LOADS@ else (cpu_core@EXE_ACTIVITY.1_PORTS_UTIL@ + tma_retiring * cpu= _core@EXE_ACTIVITY.2_PORTS_UTIL\\,umask\\=3D0xc@) / tma_info_thread_clks)", + "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_3_PORTS_UTIL)) / tm= a_info_thread_clks if ARITH.DIV_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - EXE_= ACTIVITY.BOUND_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EX= E_ACTIVITY.2_3_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", - "MetricExpr": "(cpu_core@EXE_ACTIVITY.EXE_BOUND_0_PORTS@ + max(cpu= _core@RS.EMPTY\\,umask\\=3D1@ - cpu_core@RESOURCE_STALLS.SCOREBOARD@, 0)) /= tma_info_thread_clks * (cpu_core@CYCLE_ACTIVITY.STALLS_TOTAL@ - cpu_core@E= XE_ACTIVITY.BOUND_ON_LOADS@) / tma_info_thread_clks", + "MetricExpr": "(EXE_ACTIVITY.EXE_BOUND_0_PORTS + max(RS.EMPTY_RESO= URCE - RESOURCE_STALLS.SCOREBOARD, 0)) / tma_info_thread_clks * (CYCLE_ACTI= VITY.STALLS_TOTAL - EXE_ACTIVITY.BOUND_ON_LOADS) / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "cpu_core@EXE_ACTIVITY.1_PORTS_UTIL@ / tma_info_thre= ad_clks", + "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2372,21 +4450,21 @@ { "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "cpu_core@EXE_ACTIVITY.2_PORTS_UTIL@ / tma_info_thre= ad_clks", + "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma= _int_vector_256b, tma_port_0, tma_port_1, tma_port_6", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "cpu_core@UOPS_EXECUTED.CYCLES_GE_3@ / tma_info_thre= ad_clks", + "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2394,7 +4472,7 @@ { "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "cpu_core@topdown\\-retiring@ / (cpu_core@topdown\\-= fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@= + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", "MetricGroup": "BvUW;Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", @@ -2405,98 +4483,98 @@ }, { "BriefDescription": "This metric represents fraction of cycles the= CPU issue-pipeline was stalled due to serializing operations", - "MetricExpr": "cpu_core@RESOURCE_STALLS.SCOREBOARD@ / tma_info_thr= ead_clks + tma_c02_wait", + "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_thread_clks += tma_c02_wait", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring Shuffle operations of 256-bit vector size (FP or Int= eger)", - "MetricExpr": "tma_light_operations * cpu_core@INT_VEC_RETIRED.SHU= FFLES@ / (tma_retiring * tma_info_thread_slots)", + "MetricExpr": "tma_light_operations * INT_VEC_RETIRED.SHUFFLES / (= tma_retiring * tma_info_thread_slots)", "MetricGroup": "HPC;Pipeline;TopdownL4;tma_L4_group;tma_other_ligh= t_ops_group", "MetricName": "tma_shuffles_256b", - "MetricThreshold": "tma_shuffles_256b > 0.1 & (tma_other_light_ops= > 0.3 & tma_light_operations > 0.6)", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers.", + "MetricThreshold": "tma_shuffles_256b > 0.1 & tma_other_light_ops = > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to PAUSE Instructions", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.PAUSE@ / tma_info_thread_= clks", + "MetricExpr": "CPU_CLK_UNHALTED.PAUSE / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_slow_pause", - "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.= PAUSE_INST", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles hand= ling memory load split accesses - load that cross 64-byte cache line bounda= ry", - "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu_core@L= D_BLOCKS.NO_SR@ / tma_info_thread_clks", + "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents rate of split store ac= cesses", - "MetricExpr": "cpu_core@MEM_INST_RETIRED.SPLIT_STORES@ / tma_info_= core_core_clks", + "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", - "MetricExpr": "(cpu_core@XQ.FULL_CYCLES@ + cpu_core@L1D_PEND_MISS.= L2_STALLS@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricExpr": "(XQ.FULL_CYCLES + L1D_PEND_MISS.L2_STALLS) / tma_in= fo_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_bottleneck_cache_memory_bandwidth, tma_info_system_dram_bw_use, = tma_mem_bandwidth", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates how often CPU was stall= ed due to RFO store memory accesses; RFO store issue a read-for-ownership = request before the write", - "MetricExpr": "cpu_core@EXE_ACTIVITY.BOUND_ON_STORES@ / tma_info_t= hread_clks", + "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates fraction of cyc= les when the memory subsystem had loads blocked since they could not forwar= d data from earlier (in program order) overlapping stores", - "MetricExpr": "13 * cpu_core@LD_BLOCKS.STORE_FORWARD@ / tma_info_t= hread_clks", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", - "MetricExpr": "(cpu_core@MEM_STORE_RETIRED.L2_HIT@ * 10 * (1 - cpu= _core@MEM_INST_RETIRED.LOCK_LOADS@ / cpu_core@MEM_INST_RETIRED.ALL_STORES@)= + (1 - cpu_core@MEM_INST_RETIRED.LOCK_LOADS@ / cpu_core@MEM_INST_RETIRED.A= LL_STORES@) * min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFFCORE_REQUE= STS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO@)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricExpr": "(MEM_STORE_RETIRED.L2_HIT * 10 * (1 - MEM_INST_RETI= RED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_= LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE= _REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Store operations", - "MetricExpr": "(cpu_core@UOPS_DISPATCHED.PORT_4_9@ + cpu_core@UOPS= _DISPATCHED.PORT_7_8@) / (4 * tma_info_core_core_clks)", + "MetricExpr": "(UOPS_DISPATCHED.PORT_4_9 + UOPS_DISPATCHED.PORT_7_= 8) / (4 * tma_info_core_core_clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_store_op_utilization", "MetricThreshold": "tma_store_op_utilization > 0.6", @@ -2509,46 +4587,73 @@ "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates the fraction of cycles = where the STLB was missed by store accesses, performing a hardware page wal= k", - "MetricExpr": "cpu_core@DTLB_STORE_MISSES.WALK_ACTIVE@ / tma_info_= core_core_clks", + "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates how often CPU was stall= ed due to Streaming store memory accesses; Streaming store optimize out a = read request required by RFO stores", - "MetricExpr": "9 * cpu_core@OCR.STREAMING_WR.ANY_RESPONSE@ / tma_i= nfo_thread_clks", + "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", "MetricName": "tma_streaming_stores", - "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound = > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to new branch address clears", - "MetricExpr": "cpu_core@INT_MISC.UNKNOWN_BRANCH_CYCLES@ / tma_info= _thread_clks", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / tma_info_thread_cl= ks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric serves as an approximation of leg= acy x87 usage", - "MetricExpr": "tma_retiring * cpu_core@UOPS_EXECUTED.X87@ / cpu_co= re@UOPS_EXECUTED.THREAD@", + "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", "ScaleUnit": "100%", "Unit": "cpu_core" } diff --git a/tools/perf/pmu-events/arch/x86/alderlake/cache.json b/tools/pe= rf/pmu-events/arch/x86/alderlake/cache.json index 3f51686fe7a8..a20e19738046 100644 --- a/tools/perf/pmu-events/arch/x86/alderlake/cache.json +++ b/tools/perf/pmu-events/arch/x86/alderlake/cache.json @@ -91,6 +91,26 @@ "UMask": "0x1f", "Unit": "cpu_core" }, + { + "BriefDescription": "Modified cache lines that are evicted by L2 c= ache when triggered by an L2 cache fill.", + "Counter": "0,1,2,3", + "EventCode": "0x26", + "EventName": "L2_LINES_OUT.NON_SILENT", + "PublicDescription": "Counts the number of lines that are evicted = by L2 cache when triggered by an L2 cache fill. Those lines are in Modified= state. Modified lines are written back to L3", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache.", + "Counter": "0,1,2,3", + "EventCode": "0x26", + "EventName": "L2_LINES_OUT.SILENT", + "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache. These lines are typically in Shared or Exclusive stat= e. A non-threaded event.", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_core" + }, { "BriefDescription": "Cache lines that have been L2 hardware prefet= ched but not used by demand accesses", "Counter": "0,1,2,3", @@ -101,6 +121,15 @@ "UMask": "0x4", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the total number of L2 Cache accesses.= Counts on a per core basis.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0x24", + "EventName": "L2_REQUEST.ALL", + "PublicDescription": "Counts the total number of L2 Cache Accesses= , includes hits, misses, rejects front door requests for CRd/DRd/RFO/ItoM/= L2 Prefetches only. Counts on a per core basis.", + "SampleAfterValue": "200003", + "Unit": "cpu_atom" + }, { "BriefDescription": "All accesses to L2 cache [This event is alias= to L2_RQSTS.REFERENCES]", "Counter": "0,1,2,3", @@ -111,6 +140,26 @@ "UMask": "0xff", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the number of L2 Cache accesses that r= esulted in a hit. Counts on a per core basis.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0x24", + "EventName": "L2_REQUEST.HIT", + "PublicDescription": "Counts the number of L2 Cache accesses that = resulted in a hit from a front door request only (does not include rejects = or recycles), Counts on a per core basis.", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of L2 Cache accesses that r= esulted in a miss. Counts on a per core basis.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0x24", + "EventName": "L2_REQUEST.MISS", + "PublicDescription": "Counts the number of L2 Cache accesses that = resulted in a miss from a front door request only (does not include rejects= or recycles). Counts on a per core basis.", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Read requests with true-miss in L2 cache. [Th= is event is alias to L2_RQSTS.MISS]", "Counter": "0,1,2,3", @@ -412,7 +461,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_LOADS", - "PEBS": "1", "PublicDescription": "Counts all retired load instructions. This e= vent accounts for SW prefetch instructions of PREFETCHNTA or PREFETCHT0/1/2= or PREFETCHW.", "SampleAfterValue": "1000003", "UMask": "0x81", @@ -424,7 +472,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_STORES", - "PEBS": "1", "PublicDescription": "Counts all retired store instructions.", "SampleAfterValue": "1000003", "UMask": "0x82", @@ -436,7 +483,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts all retired memory instructions - loa= ds and stores.", "SampleAfterValue": "1000003", "UMask": "0x83", @@ -448,7 +494,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.LOCK_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with locked= access.", "SampleAfterValue": "100007", "UMask": "0x21", @@ -460,7 +505,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions that split = across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x41", @@ -472,7 +516,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_STORES", - "PEBS": "1", "PublicDescription": "Counts retired store instructions that split= across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x42", @@ -484,7 +527,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_LOADS", - "PEBS": "1", "PublicDescription": "Number of retired load instructions that (st= art a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x11", @@ -496,7 +538,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_STORES", - "PEBS": "1", "PublicDescription": "Number of retired store instructions that (s= tart a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x12", @@ -518,7 +559,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were HitM responses from shared L3.", "SampleAfterValue": "20011", "UMask": "0x4", @@ -530,7 +570,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were L3 and cross-core snoop hits in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x2", @@ -542,7 +581,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were HitM responses from shared L3.", "SampleAfterValue": "20011", "UMask": "0x4", @@ -554,7 +592,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS", - "PEBS": "1", "PublicDescription": "Counts the retired load instructions whose d= ata sources were L3 hit and cross-core snoop missed in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x1", @@ -566,7 +603,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NONE", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were hits in L3 without snoops required.", "SampleAfterValue": "100003", "UMask": "0x8", @@ -578,7 +614,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were L3 and cross-core snoop hits in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x2", @@ -590,7 +625,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM", - "PEBS": "1", "PublicDescription": "Retired load instructions which data sources= missed L3 but serviced from local DRAM.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -602,7 +636,6 @@ "Data_LA": "1", "EventCode": "0xd4", "EventName": "MEM_LOAD_MISC_RETIRED.UC", - "PEBS": "1", "PublicDescription": "Retired instructions with at least one load = to uncacheable memory-type, or at least one cache-line split locked access = (Bus Lock).", "SampleAfterValue": "100007", "UMask": "0x4", @@ -614,7 +647,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.FB_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop was load missed in L1 but hit FB (Fill Buffers) due to preceding= miss to the same cache line with data not ready.", "SampleAfterValue": "100007", "UMask": "0x40", @@ -626,7 +658,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L1 data cache. This event includes all SW prefet= ches and lock instructions regardless of the data source.", "SampleAfterValue": "1000003", "UMask": "0x1", @@ -638,7 +669,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L1 cache.", "SampleAfterValue": "200003", "UMask": "0x8", @@ -650,7 +680,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with L2 cac= he hits as data sources.", "SampleAfterValue": "200003", "UMask": "0x2", @@ -662,7 +691,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions missed L2 c= ache as data sources.", "SampleAfterValue": "100021", "UMask": "0x10", @@ -674,7 +702,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L3 cache.", "SampleAfterValue": "100021", "UMask": "0x4", @@ -686,7 +713,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L3 cache.", "SampleAfterValue": "50021", "UMask": "0x20", @@ -698,33 +724,90 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_UOPS_RETIRED.DRAM_HIT", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x80", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts the number of load uops retired that h= it in the L3 cache, in which a snoop was required and modified data was for= warded from another core or module.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.HITM", + "SampleAfterValue": "200003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load uops retired that h= it in the L1 data cache.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L1_HIT", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load uops retired that m= iss in the L1 data cache.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L1_MISS", + "SampleAfterValue": "200003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts the number of load uops retired that h= it in the L2 cache.", "Counter": "0,1,2,3,4,5", "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_UOPS_RETIRED.L2_HIT", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x2", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts the number of load uops retired that m= iss in the L2 cache.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L2_MISS", + "SampleAfterValue": "200003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts the number of load uops retired that h= it in the L3 cache.", "Counter": "0,1,2,3,4,5", "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_UOPS_RETIRED.L3_HIT", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x4", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts the number of load uops retired that h= it in the L3 cache, in which a snoop was required, and non-modified data wa= s forwarded.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd2", + "EventName": "MEM_LOAD_UOPS_RETIRED_MISC.HIT_E_F", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load uops retired that m= iss in the L3 cache.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd2", + "EventName": "MEM_LOAD_UOPS_RETIRED_MISC.L3_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts the number of cycles that uops are blo= cked for any of the following reasons: load buffer, store buffer or RSV fu= ll.", "Counter": "0,1,2,3,4,5", @@ -776,7 +859,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_UOPS_RETIRED.ALL_LOADS", - "PEBS": "1", "PublicDescription": "Counts the total number of load uops retired= .", "SampleAfterValue": "200003", "UMask": "0x81", @@ -788,7 +870,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_UOPS_RETIRED.ALL_STORES", - "PEBS": "1", "PublicDescription": "Counts the total number of store uops retire= d.", "SampleAfterValue": "200003", "UMask": "0x82", @@ -802,7 +883,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_128", "MSRIndex": "0x3F6", "MSRValue": "0x80", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 128 cycles as def= ined in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled.= If a PEBS record is generated, will populate the PEBS Latency and PEBS Dat= a Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5", @@ -816,7 +896,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_16", "MSRIndex": "0x3F6", "MSRValue": "0x10", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 16 cycles as defi= ned in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled. = If a PEBS record is generated, will populate the PEBS Latency and PEBS Data= Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5", @@ -830,7 +909,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_256", "MSRIndex": "0x3F6", "MSRValue": "0x100", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 256 cycles as def= ined in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled.= If a PEBS record is generated, will populate the PEBS Latency and PEBS Dat= a Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5", @@ -844,7 +922,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_32", "MSRIndex": "0x3F6", "MSRValue": "0x20", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 32 cycles as defi= ned in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled. = If a PEBS record is generated, will populate the PEBS Latency and PEBS Data= Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5", @@ -858,7 +935,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_4", "MSRIndex": "0x3F6", "MSRValue": "0x4", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 4 cycles as defin= ed in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled. I= f a PEBS record is generated, will populate the PEBS Latency and PEBS Data = Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5", @@ -872,7 +948,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_512", "MSRIndex": "0x3F6", "MSRValue": "0x200", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 512 cycles as def= ined in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled.= If a PEBS record is generated, will populate the PEBS Latency and PEBS Dat= a Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5", @@ -886,7 +961,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_64", "MSRIndex": "0x3F6", "MSRValue": "0x40", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 64 cycles as defi= ned in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled. = If a PEBS record is generated, will populate the PEBS Latency and PEBS Data= Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5", @@ -900,7 +974,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_8", "MSRIndex": "0x3F6", "MSRValue": "0x8", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 8 cycles as defin= ed in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled. I= f a PEBS record is generated, will populate the PEBS Latency and PEBS Data = Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5", @@ -912,7 +985,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_UOPS_RETIRED.LOCK_LOADS", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x21", "Unit": "cpu_atom" @@ -923,18 +995,46 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_UOPS_RETIRED.SPLIT_LOADS", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x41", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts the total number of load and store uop= s retired that missed in the second level TLB.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS", + "SampleAfterValue": "200003", + "UMask": "0x13", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load ops retired that mi= ss in the second Level TLB.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x11", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of store ops retired that m= iss in the second level TLB.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS_STORES", + "SampleAfterValue": "200003", + "UMask": "0x12", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts the number of stores uops retired. Cou= nts with or without PEBS enabled.", "Counter": "0,1,2,3,4,5", "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_UOPS_RETIRED.STORE_LATENCY", - "PEBS": "2", "PublicDescription": "Counts the number of stores uops retired. Co= unts with or without PEBS enabled. If PEBS is enabled and a PEBS record is = generated, will populate PEBS Latency and PEBS Data Source fields according= ly.", "SampleAfterValue": "1000003", "UMask": "0x6", @@ -950,13 +1050,57 @@ "UMask": "0x3", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by the L3 cache.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.L3_HIT", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1F803C0004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by the L3 cache where a snoop w= as sent, the snoop hit, and modified data was forwarded.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.L3_HIT.SNOOP_HITM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10003C0004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by the L3 cache where a snoop w= as sent, the snoop hit, but no data was forwarded.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.L3_HIT.SNOOP_HIT_NO_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x4003C0004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by the L3 cache where a snoop w= as sent, the snoop hit, and non-modified data was forwarded.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.L3_HIT.SNOOP_HIT_WITH_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x8003C0004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts demand data reads that were supplied b= y the L3 cache.", "Counter": "0,1,2,3,4,5", "EventCode": "0xB7", "EventName": "OCR.DEMAND_DATA_RD.L3_HIT", "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3F803C0001", + "MSRValue": "0x1F803C0001", "SampleAfterValue": "100003", "UMask": "0x1", "Unit": "cpu_atom" @@ -1022,7 +1166,7 @@ "EventCode": "0xB7", "EventName": "OCR.DEMAND_RFO.L3_HIT", "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3F803C0002", + "MSRValue": "0x1F803C0002", "SampleAfterValue": "100003", "UMask": "0x1", "Unit": "cpu_atom" @@ -1049,6 +1193,72 @@ "UMask": "0x1", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y the L3 cache where a snoop was sent, the snoop hit, but no data was forwa= rded.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_RFO.L3_HIT.SNOOP_HIT_NO_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x4003C0002", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y the L3 cache where a snoop was sent, the snoop hit, and non-modified data= was forwarded.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_RFO.L3_HIT.SNOOP_HIT_WITH_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x8003C0002", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were supplied by the L3 = cache.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.L3_HIT", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1F803C4000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were supplied by the L3 = cache where a snoop was sent, the snoop hit, and modified data was forwarde= d.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.L3_HIT.SNOOP_HITM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10003C4000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were supplied by the L3 = cache where a snoop was sent, the snoop hit, but no data was forwarded.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.L3_HIT.SNOOP_HIT_NO_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x4003C4000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were supplied by the L3 = cache where a snoop was sent, the snoop hit, and non-modified data was forw= arded.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.L3_HIT.SNOOP_HIT_WITH_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x8003C4000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "OFFCORE_REQUESTS.ALL_REQUESTS", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/alderlake/floating-point.json b= /tools/perf/pmu-events/arch/x86/alderlake/floating-point.json index b4621c221f58..62fd70f220e5 100644 --- a/tools/perf/pmu-events/arch/x86/alderlake/floating-point.json +++ b/tools/perf/pmu-events/arch/x86/alderlake/floating-point.json @@ -1,4 +1,13 @@ [ + { + "BriefDescription": "Counts the number of cycles the floating poin= t divider is in the loop stage.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xcd", + "EventName": "ARITH.FPDIV_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, { "BriefDescription": "ARITH.FPDIV_ACTIVE", "Counter": "0,1,2,3,4,5,6,7", @@ -9,6 +18,15 @@ "UMask": "0x1", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the number of floating point divider u= ops executed per cycle.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xcd", + "EventName": "ARITH.FPDIV_UOPS", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts all microcode FP assists.", "Counter": "0,1,2,3,4,5,6,7", @@ -187,7 +205,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc2", "EventName": "UOPS_RETIRED.FPDIV", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x8", "Unit": "cpu_atom" diff --git a/tools/perf/pmu-events/arch/x86/alderlake/frontend.json b/tools= /perf/pmu-events/arch/x86/alderlake/frontend.json index 66735a612ebd..c5b3818ad479 100644 --- a/tools/perf/pmu-events/arch/x86/alderlake/frontend.json +++ b/tools/perf/pmu-events/arch/x86/alderlake/frontend.json @@ -55,7 +55,6 @@ "EventName": "FRONTEND_RETIRED.ANY_DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x1", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= DSB (Decode stream buffer i.e. the decoded instruction-cache) miss.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -68,7 +67,6 @@ "EventName": "FRONTEND_RETIRED.DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x11", - "PEBS": "1", "PublicDescription": "Number of retired Instructions that experien= ced a critical DSB (Decode stream buffer i.e. the decoded instruction-cache= ) miss. Critical means stalls were exposed to the back-end as a result of t= he DSB miss.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -81,7 +79,6 @@ "EventName": "FRONTEND_RETIRED.ITLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x14", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= iTLB (Instruction TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -94,7 +91,6 @@ "EventName": "FRONTEND_RETIRED.L1I_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x12", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L1 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -107,7 +103,6 @@ "EventName": "FRONTEND_RETIRED.L2_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x13", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L2 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -120,7 +115,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x600106", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 1 cycle which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -133,7 +127,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_128", "MSRIndex": "0x3F7", "MSRValue": "0x608006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 12= 8 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -146,7 +139,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_16", "MSRIndex": "0x3F7", "MSRValue": "0x601006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 16 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -159,7 +151,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2", "MSRIndex": "0x3F7", "MSRValue": "0x600206", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -172,7 +163,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_256", "MSRIndex": "0x3F7", "MSRValue": "0x610006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 25= 6 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -185,7 +175,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x100206", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after the front-end had at least 1 bubble-slot for a per= iod of 2 cycles. A bubble-slot is an empty issue-pipeline slot while there = was no RAT stall.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -198,7 +187,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_32", "MSRIndex": "0x3F7", "MSRValue": "0x602006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 32 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -211,7 +199,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_4", "MSRIndex": "0x3F7", "MSRValue": "0x600406", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 4 = cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -224,7 +211,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_512", "MSRIndex": "0x3F7", "MSRValue": "0x620006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 51= 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -237,7 +223,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_64", "MSRIndex": "0x3F7", "MSRValue": "0x604006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 64= cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -250,7 +235,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_8", "MSRIndex": "0x3F7", "MSRValue": "0x600806", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 8 cycles. During thi= s period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -263,7 +247,6 @@ "EventName": "FRONTEND_RETIRED.MS_FLOWS", "MSRIndex": "0x3F7", "MSRValue": "0x8", - "PEBS": "1", "SampleAfterValue": "100007", "UMask": "0x1", "Unit": "cpu_core" @@ -275,7 +258,6 @@ "EventName": "FRONTEND_RETIRED.STLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x15", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= STLB (2nd level TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -288,7 +270,6 @@ "EventName": "FRONTEND_RETIRED.UNKNOWN_BRANCH", "MSRIndex": "0x3F7", "MSRValue": "0x17", - "PEBS": "1", "SampleAfterValue": "100007", "UMask": "0x1", "Unit": "cpu_core" diff --git a/tools/perf/pmu-events/arch/x86/alderlake/memory.json b/tools/p= erf/pmu-events/arch/x86/alderlake/memory.json index 81a03f53aadc..fa15f5797bed 100644 --- a/tools/perf/pmu-events/arch/x86/alderlake/memory.json +++ b/tools/perf/pmu-events/arch/x86/alderlake/memory.json @@ -133,7 +133,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_1024", "MSRIndex": "0x3F6", "MSRValue": "0x400", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 1024 cycles. Reporte= d latency may be longer than just the memory latency.", "SampleAfterValue": "53", "UMask": "0x1", @@ -147,7 +146,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_128", "MSRIndex": "0x3F6", "MSRValue": "0x80", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 128 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "1009", "UMask": "0x1", @@ -161,7 +159,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_16", "MSRIndex": "0x3F6", "MSRValue": "0x10", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 16 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "20011", "UMask": "0x1", @@ -175,7 +172,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_256", "MSRIndex": "0x3F6", "MSRValue": "0x100", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 256 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "503", "UMask": "0x1", @@ -189,7 +185,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32", "MSRIndex": "0x3F6", "MSRValue": "0x20", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 32 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -203,7 +198,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4", "MSRIndex": "0x3F6", "MSRValue": "0x4", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 4 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "100003", "UMask": "0x1", @@ -217,7 +211,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_512", "MSRIndex": "0x3F6", "MSRValue": "0x200", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 512 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "101", "UMask": "0x1", @@ -231,7 +224,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64", "MSRIndex": "0x3F6", "MSRValue": "0x40", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 64 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "2003", "UMask": "0x1", @@ -245,7 +237,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_8", "MSRIndex": "0x3F6", "MSRValue": "0x8", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 8 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "50021", "UMask": "0x1", @@ -257,12 +248,22 @@ "Data_LA": "1", "EventCode": "0xcd", "EventName": "MEM_TRANS_RETIRED.STORE_SAMPLE", - "PEBS": "2", "PublicDescription": "Counts Retired memory accesses with at least= 1 store operation. This PEBS event is the precisely-distributed (PDist) tr= igger covering all stores uops for sampling by the PEBS Store Latency Facil= ity. The facility is described in Intel SDM Volume 3 section 19.9.8", "SampleAfterValue": "1000003", "UMask": "0x2", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were not supplied by the L3 cache.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.L3_MISS", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F84400004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts demand data reads that were not suppli= ed by the L3 cache.", "Counter": "0,1,2,3,4,5", @@ -329,6 +330,17 @@ "UMask": "0x1", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were not supplied by the= L3 cache.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.L3_MISS", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F84404000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts demand data read requests that miss th= e L3 cache.", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/alderlake/metricgroups.json b/t= ools/perf/pmu-events/arch/x86/alderlake/metricgroups.json index b54a5fc0861f..855585fe6fae 100644 --- a/tools/perf/pmu-events/arch/x86/alderlake/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/alderlake/metricgroups.json @@ -41,6 +41,7 @@ "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", "Load_Store_Miss": "Grouping from Top-down Microarchitecture Analysis = Metrics spreadsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -89,7 +90,9 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_mispredicts_group": "Metrics contributing to tma_branch_mi= spredicts category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", + "tma_code_stlb_miss_group": "Metrics contributing to tma_code_stlb_mis= s category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -99,6 +102,7 @@ "tma_fp_vector_group": "Metrics contributing to tma_fp_vector category= ", "tma_frontend_bound_group": "Metrics contributing to tma_frontend_boun= d category", "tma_heavy_operations_group": "Metrics contributing to tma_heavy_opera= tions category", + "tma_icache_misses_group": "Metrics contributing to tma_icache_misses = category", "tma_ifetch_bandwidth_group": "Metrics contributing to tma_ifetch_band= width category", "tma_ifetch_latency_group": "Metrics contributing to tma_ifetch_latenc= y category", "tma_int_operations_group": "Metrics contributing to tma_int_operation= s category", @@ -121,10 +125,13 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", + "tma_l2_bound_group": "Metrics contributing to tma_l2_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", "tma_load_op_utilization_group": "Metrics contributing to tma_load_op_= utilization category", + "tma_load_stlb_miss_group": "Metrics contributing to tma_load_stlb_mis= s category", "tma_machine_clears_group": "Metrics contributing to tma_machine_clear= s category", "tma_mem_latency_group": "Metrics contributing to tma_mem_latency cate= gory", "tma_memory_bound_group": "Metrics contributing to tma_memory_bound ca= tegory", @@ -138,5 +145,6 @@ "tma_retiring_group": "Metrics contributing to tma_retiring category", "tma_serializing_operation_group": "Metrics contributing to tma_serial= izing_operation category", "tma_store_bound_group": "Metrics contributing to tma_store_bound cate= gory", - "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category" + "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category", + "tma_store_stlb_miss_group": "Metrics contributing to tma_store_stlb_m= iss category" } diff --git a/tools/perf/pmu-events/arch/x86/alderlake/other.json b/tools/pe= rf/pmu-events/arch/x86/alderlake/other.json index f95e093f8fcf..a8b23e92408c 100644 --- a/tools/perf/pmu-events/arch/x86/alderlake/other.json +++ b/tools/perf/pmu-events/arch/x86/alderlake/other.json @@ -1,9 +1,10 @@ [ { - "BriefDescription": "ASSISTS.HARDWARE", + "BriefDescription": "Count all other hardware assists or traps tha= t are not necessarily architecturally exposed (through a software handler) = beyond FP; SSE-AVX mix and A/D assists who are counted by dedicated sub-eve= nts. the event also counts for Machine Ordering count.", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc1", "EventName": "ASSISTS.HARDWARE", + "PublicDescription": "Count all other hardware assists or traps th= at are not necessarily architecturally exposed (through a software handler)= beyond FP; SSE-AVX mix and A/D assists who are counted by dedicated sub-ev= ents. This includes, but not limited to, assists at EXE or MEM uop writeba= ck like AVX* load/store/gather/scatter (non-FP GSSE-assist ) , assists gene= rated by ROB like PEBS and RTIT, Uncore trap, RAR (Remote Action Request) a= nd CET (Control flow Enforcement Technology) assists. the event also counts= for Machine Ordering count.", "SampleAfterValue": "100003", "UMask": "0x4", "Unit": "cpu_core" @@ -50,7 +51,6 @@ "Deprecated": "1", "EventCode": "0xe4", "EventName": "LBR_INSERTS.ANY", - "PEBS": "1", "SampleAfterValue": "1000003", "UMask": "0x1", "Unit": "cpu_atom" @@ -66,6 +66,28 @@ "UMask": "0x1", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that have any type of response.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x784000004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts demand data reads that have any type o= f response.", "Counter": "0,1,2,3,4,5", @@ -88,6 +110,17 @@ "UMask": "0x1", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_DATA_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x784000001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", "Counter": "0,1,2,3", @@ -121,6 +154,39 @@ "UMask": "0x1", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y DRAM.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_RFO.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x784000002", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts streaming stores which modify a full 6= 4 byte cacheline that have any type of response.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.FULL_STREAMING_WR.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x800000010000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts streaming stores which modify only par= t of a 64 byte cacheline that have any type of response.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.PARTIAL_STREAMING_WR.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x400000010000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts streaming stores that have any type of= response.", "Counter": "0,1,2,3,4,5", @@ -143,6 +209,28 @@ "UMask": "0x1", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that have any type of respons= e.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x14000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were supplied by DRAM.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x784004000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Cycles when Reservation Station (RS) is empty= for the thread.", "Counter": "0,1,2,3,4,5,6,7", diff --git a/tools/perf/pmu-events/arch/x86/alderlake/pipeline.json b/tools= /perf/pmu-events/arch/x86/alderlake/pipeline.json index b7656f77dee9..f5bf0816f190 100644 --- a/tools/perf/pmu-events/arch/x86/alderlake/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/alderlake/pipeline.json @@ -10,6 +10,16 @@ "UMask": "0x9", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the number of cycles when any of the f= loating point or integer dividers are active.", + "Counter": "0,1,2,3,4,5", + "CounterMask": "1", + "EventCode": "0xcd", + "EventName": "ARITH.DIV_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_atom" + }, { "BriefDescription": "Cycles when divide unit is busy executing div= ide or square root operations.", "Counter": "0,1,2,3,4,5,6,7", @@ -21,6 +31,24 @@ "UMask": "0x9", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the number of active floating point an= d integer dividers per cycle.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xcd", + "EventName": "ARITH.DIV_OCCUPANCY", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of floating point and integ= er divider uops executed per cycle.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xcd", + "EventName": "ARITH.DIV_UOPS", + "SampleAfterValue": "1000003", + "UMask": "0xc", + "Unit": "cpu_atom" + }, { "BriefDescription": "This event is deprecated. Refer to new event = ARITH.FPDIV_ACTIVE", "Counter": "0,1,2,3,4,5,6,7", @@ -32,6 +60,16 @@ "UMask": "0x1", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the number of cycles any of the two in= teger dividers are active.", + "Counter": "0,1,2,3,4,5", + "CounterMask": "1", + "EventCode": "0xcd", + "EventName": "ARITH.IDIV_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "This event counts the cycles the integer divi= der is busy.", "Counter": "0,1,2,3,4,5,6,7", @@ -42,6 +80,24 @@ "UMask": "0x8", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the number of active integer dividers = per cycle.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xcd", + "EventName": "ARITH.IDIV_OCCUPANCY", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of integer divider uops exe= cuted per cycle.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xcd", + "EventName": "ARITH.IDIV_UOPS", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, { "BriefDescription": "This event is deprecated. Refer to new event = ARITH.IDIV_ACTIVE", "Counter": "0,1,2,3,4,5,6,7", @@ -68,7 +124,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts the total number of instructions in w= hich the instruction pointer (IP) of the processor is resteered due to a br= anch instruction and the branch instruction successfully retires. All bran= ch type instructions are accounted for.", "SampleAfterValue": "200003", "Unit": "cpu_atom" @@ -78,7 +133,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all branch instructions retired.", "SampleAfterValue": "400009", "Unit": "cpu_core" @@ -89,7 +143,6 @@ "Deprecated": "1", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.CALL", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xf9", "Unit": "cpu_atom" @@ -99,7 +152,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x7e", "Unit": "cpu_atom" @@ -109,7 +161,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts conditional branch instructions retir= ed.", "SampleAfterValue": "400009", "UMask": "0x11", @@ -120,7 +171,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts not taken branch instructions retired= .", "SampleAfterValue": "400009", "UMask": "0x10", @@ -131,7 +181,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_TAKEN", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfe", "Unit": "cpu_atom" @@ -141,7 +190,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional branch instructions= retired.", "SampleAfterValue": "400009", "UMask": "0x1", @@ -152,7 +200,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.FAR_BRANCH", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xbf", "Unit": "cpu_atom" @@ -162,7 +209,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.FAR_BRANCH", - "PEBS": "1", "PublicDescription": "Counts far branch instructions retired.", "SampleAfterValue": "100007", "UMask": "0x40", @@ -173,7 +219,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.INDIRECT", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xeb", "Unit": "cpu_atom" @@ -183,7 +228,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts near indirect branch instructions ret= ired excluding returns. TSX abort is an indirect branch.", "SampleAfterValue": "100003", "UMask": "0x80", @@ -194,7 +238,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.INDIRECT_CALL", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfb", "Unit": "cpu_atom" @@ -205,7 +248,6 @@ "Deprecated": "1", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.IND_CALL", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfb", "Unit": "cpu_atom" @@ -216,7 +258,6 @@ "Deprecated": "1", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.JCC", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x7e", "Unit": "cpu_atom" @@ -226,7 +267,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_CALL", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xf9", "Unit": "cpu_atom" @@ -236,7 +276,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_CALL", - "PEBS": "1", "PublicDescription": "Counts both direct and indirect near call in= structions retired.", "SampleAfterValue": "100007", "UMask": "0x2", @@ -247,7 +286,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_RETURN", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xf7", "Unit": "cpu_atom" @@ -257,7 +295,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_RETURN", - "PEBS": "1", "PublicDescription": "Counts return instructions retired.", "SampleAfterValue": "100007", "UMask": "0x8", @@ -268,7 +305,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_TAKEN", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xc0", "Unit": "cpu_atom" @@ -278,7 +314,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken branch instructions retired.", "SampleAfterValue": "400009", "UMask": "0x20", @@ -290,7 +325,6 @@ "Deprecated": "1", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NON_RETURN_IND", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xeb", "Unit": "cpu_atom" @@ -300,7 +334,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.REL_CALL", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfd", "Unit": "cpu_atom" @@ -311,7 +344,6 @@ "Deprecated": "1", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.RETURN", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xf7", "Unit": "cpu_atom" @@ -322,7 +354,6 @@ "Deprecated": "1", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.TAKEN_JCC", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfe", "Unit": "cpu_atom" @@ -332,7 +363,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts the total number of mispredicted bran= ch instructions retired. All branch type instructions are accounted for. = Prediction of the branch target address enables the processor to begin exec= uting instructions before the non-speculative execution path is known. The = branch prediction unit (BPU) predicts the target address based on the instr= uction pointer (IP) of the branch and on the execution path through which e= xecution reached this IP. A branch misprediction occurs when the predict= ion is wrong, and results in discarding all instructions executed in the sp= eculative path and re-fetching from the correct path.", "SampleAfterValue": "200003", "Unit": "cpu_atom" @@ -342,7 +372,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all the retired branch instructions t= hat were mispredicted by the processor. A branch misprediction occurs when = the processor incorrectly predicts the destination of the branch. When the= misprediction is discovered at execution, all the instructions executed in= the wrong (speculative) path must be discarded, and the processor must sta= rt fetching from the correct path.", "SampleAfterValue": "400009", "Unit": "cpu_core" @@ -352,7 +381,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x7e", "Unit": "cpu_atom" @@ -362,7 +390,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts mispredicted conditional branch instr= uctions retired.", "SampleAfterValue": "400009", "UMask": "0x11", @@ -373,7 +400,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts the number of conditional branch inst= ructions retired that were mispredicted and the branch direction was not ta= ken.", "SampleAfterValue": "400009", "UMask": "0x10", @@ -384,7 +410,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_TAKEN", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfe", "Unit": "cpu_atom" @@ -394,7 +419,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional mispredicted branch= instructions retired.", "SampleAfterValue": "400009", "UMask": "0x1", @@ -405,7 +429,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xeb", "Unit": "cpu_atom" @@ -415,7 +438,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts miss-predicted near indirect branch i= nstructions retired excluding returns. TSX abort is an indirect branch.", "SampleAfterValue": "100003", "UMask": "0x80", @@ -426,7 +448,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT_CALL", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfb", "Unit": "cpu_atom" @@ -436,7 +457,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT_CALL", - "PEBS": "1", "PublicDescription": "Counts retired mispredicted indirect (near t= aken) CALL instructions, including both register and memory indirect.", "SampleAfterValue": "400009", "UMask": "0x2", @@ -448,7 +468,6 @@ "Deprecated": "1", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.IND_CALL", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfb", "Unit": "cpu_atom" @@ -459,7 +478,6 @@ "Deprecated": "1", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.JCC", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x7e", "Unit": "cpu_atom" @@ -469,7 +487,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.NEAR_TAKEN", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x80", "Unit": "cpu_atom" @@ -479,7 +496,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts number of near branch instructions re= tired that were mispredicted and taken.", "SampleAfterValue": "400009", "UMask": "0x20", @@ -491,7 +507,6 @@ "Deprecated": "1", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.NON_RETURN_IND", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xeb", "Unit": "cpu_atom" @@ -501,7 +516,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.RET", - "PEBS": "1", "PublicDescription": "This is a non-precise version (that is, does= not use PEBS) of the event that counts mispredicted return instructions re= tired.", "SampleAfterValue": "100007", "UMask": "0x8", @@ -512,7 +526,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.RETURN", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xf7", "Unit": "cpu_atom" @@ -523,7 +536,6 @@ "Deprecated": "1", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.TAKEN_JCC", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfe", "Unit": "cpu_atom" @@ -616,6 +628,16 @@ "UMask": "0x40", "Unit": "cpu_core" }, + { + "BriefDescription": "This event is deprecated. Refer to new event = CPU_CLK_UNHALTED.REF_TSC_P", + "Counter": "0,1,2,3,4,5", + "Deprecated": "1", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.REF", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Core crystal clock cycles. Cycle counts are e= venly distributed between active threads in the Core.", "Counter": "0,1,2,3,4,5,6,7", @@ -854,7 +876,6 @@ "BriefDescription": "Counts the total number of instructions retir= ed. (Fixed event)", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts the total number of instructions that= retired. For instructions that consist of multiple uops, this event counts= the retirement of the last uop of the instruction. This event continues co= unting during hardware interrupts, traps, and inside interrupt handlers. Th= is event uses fixed counter 0.", "SampleAfterValue": "2000003", "UMask": "0x1", @@ -864,7 +885,6 @@ "BriefDescription": "Number of instructions retired. Fixed Counter= - architectural event", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003", "UMask": "0x1", @@ -875,7 +895,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc0", "EventName": "INST_RETIRED.ANY_P", - "PEBS": "1", "PublicDescription": "Counts the total number of instructions that= retired. For instructions that consist of multiple uops, this event counts= the retirement of the last uop of the instruction. This event continues co= unting during hardware interrupts, traps, and inside interrupt handlers. Th= is event uses a programmable general purpose performance counter.", "SampleAfterValue": "2000003", "Unit": "cpu_atom" @@ -885,7 +904,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.ANY_P", - "PEBS": "1", "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003", "Unit": "cpu_core" @@ -895,7 +913,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.MACRO_FUSED", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x10", "Unit": "cpu_core" @@ -905,7 +922,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.NOP", - "PEBS": "1", "PublicDescription": "Counts all retired NOP or ENDBR32/64 instruc= tions", "SampleAfterValue": "2000003", "UMask": "0x2", @@ -915,7 +931,6 @@ "BriefDescription": "Precise instruction retired with PEBS precise= -distribution", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.PREC_DIST", - "PEBS": "1", "PublicDescription": "A version of INST_RETIRED that allows for a = precise distribution of samples across instructions retired. It utilizes th= e Precise Distribution of Instructions Retired (PDIR++) feature to fix bias= in how retired instructions get sampled. Use on Fixed Counter 0.", "SampleAfterValue": "2000003", "UMask": "0x1", @@ -926,7 +941,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.REP_ITERATION", - "PEBS": "1", "PublicDescription": "Number of iterations of Repeat (REP) string = retired instructions such as MOVS, CMPS, and SCAS. Each has a byte, word, a= nd doubleword version and string instructions can be repeated using a repet= ition prefix, REP, that allows their architectural execution to be repeated= a number of times as specified by the RCX register. Note the number of ite= rations is implementation-dependent.", "SampleAfterValue": "2000003", "UMask": "0x8", @@ -1065,7 +1079,6 @@ "Deprecated": "1", "EventCode": "0x03", "EventName": "LD_BLOCKS.4K_ALIAS", - "PEBS": "1", "SampleAfterValue": "1000003", "UMask": "0x4", "Unit": "cpu_atom" @@ -1075,7 +1088,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0x03", "EventName": "LD_BLOCKS.ADDRESS_ALIAS", - "PEBS": "1", "SampleAfterValue": "1000003", "UMask": "0x4", "Unit": "cpu_atom" @@ -1095,7 +1107,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0x03", "EventName": "LD_BLOCKS.DATA_UNKNOWN", - "PEBS": "1", "SampleAfterValue": "1000003", "UMask": "0x1", "Unit": "cpu_atom" @@ -1244,7 +1255,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xe4", "EventName": "MISC_RETIRED.LBR_INSERTS", - "PEBS": "1", "PublicDescription": "Counts the number of LBR entries recorded. R= equires LBRs to be enabled in IA32_LBR_CTL. This event is PDIR on GP0 and N= PEBS on all other GPs [This event is alias to LBR_INSERTS.ANY]", "SampleAfterValue": "1000003", "UMask": "0x1", @@ -1551,7 +1561,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc2", "EventName": "TOPDOWN_RETIRING.ALL", - "PEBS": "1", "SampleAfterValue": "1000003", "Unit": "cpu_atom" }, @@ -1799,7 +1808,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc2", "EventName": "UOPS_RETIRED.ALL", - "PEBS": "1", "SampleAfterValue": "2000003", "Unit": "cpu_atom" }, @@ -1829,7 +1837,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc2", "EventName": "UOPS_RETIRED.IDIV", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x10", "Unit": "cpu_atom" @@ -1839,7 +1846,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc2", "EventName": "UOPS_RETIRED.MS", - "PEBS": "1", "PublicDescription": "Counts the number of uops that are from comp= lex flows issued by the Microcode Sequencer (MS). This includes uops from f= lows due to complex instructions, faults, assists, and inserted flows.", "SampleAfterValue": "2000003", "UMask": "0x1", @@ -1895,7 +1901,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc2", "EventName": "UOPS_RETIRED.X87", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x2", "Unit": "cpu_atom" diff --git a/tools/perf/pmu-events/arch/x86/alderlake/virtual-memory.json b= /tools/perf/pmu-events/arch/x86/alderlake/virtual-memory.json index e0d8f3070778..132ce48af6d9 100644 --- a/tools/perf/pmu-events/arch/x86/alderlake/virtual-memory.json +++ b/tools/perf/pmu-events/arch/x86/alderlake/virtual-memory.json @@ -258,5 +258,38 @@ "SampleAfterValue": "1000003", "UMask": "0x90", "Unit": "cpu_atom" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = MEM_UOPS_RETIRED.STLB_MISS", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "Deprecated": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.DTLB_MISS", + "SampleAfterValue": "200003", + "UMask": "0x13", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = MEM_UOPS_RETIRED.STLB_MISS_LOADS", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "Deprecated": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.DTLB_MISS_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x11", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = MEM_UOPS_RETIRED.STLB_MISS_STORES", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "Deprecated": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.DTLB_MISS_STORES", + "SampleAfterValue": "200003", + "UMask": "0x12", + "Unit": "cpu_atom" } ] diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index d503aa7e3594..d9538723927b 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -1,5 +1,5 @@ Family-model,Version,Filename,EventType -GenuineIntel-6-(97|9A|B7|BA|BF),v1.27,alderlake,core +GenuineIntel-6-(97|9A|B7|BA|BF),v1.28,alderlake,core GenuineIntel-6-BE,v1.27,alderlaken,core GenuineIntel-6-(1C|26|27|35|36),v5,bonnell,core GenuineIntel-6-(3D|47),v29,broadwell,core --=20 2.48.0.rc2.279.g1de40edade-goog From nobody Sun Dec 14 13:49:50 2025 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 24DCC194A44 for ; Thu, 16 Jan 2025 06:44:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009856; cv=none; b=IXIQkHPmsXkwpCsZ3WUfmb7DkC4yeepX6tPdYZoILcyjwJch2Oxujk1JU6Ac5sTaYwfpJeWuQoT2IF/BG5W2lqZn72v5UfQlpJkJnF/yAWMevZ0hox6a0gwgGQye4o3zK9c3hEAd0TlvS1e4wWpcF414GiHugsSxwfTyqG5//sE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009856; c=relaxed/simple; bh=q4Cbdzs35KHlIbNyNf76pwvEpvikOU9oXD0IFvcQvP8=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=CL8+QuTWfNUHBylvC1nk56Lgfs64kklV2B1OFYsUVNUTc07DnOgWJsnpnkoVTxn1I6iw9+y46Ws9JZzH96XKYgpMvEGMI2MCMbXPjQD6EhbktcVmXoaqfAAhoR//lmSfycLyp/pbY7rtuhph9NvUPyPlI8+FhaPF6CD+DFErT4M= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=d+jvGh2v; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="d+jvGh2v" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-e5789a8458eso1646436276.2 for ; Wed, 15 Jan 2025 22:44:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737009847; x=1737614647; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=c7oE9ukq+yCo3w9jA5kMWwq0Sl7/pDWcjCQ6Z40t+Jc=; b=d+jvGh2vZzJepeu31po0Ef+g70mKjc/927pUMFxdeo1ZSdqQVh2k1IQY3tRtJ2kf8y XlRoqtAmSEVHue1uQOqpud0htwoF1vg7yhdypM7NvUh/xtqrli3tEFNeiB2ZvB2DxgJj MZdG7pzqy56I0qMMeFrwKGlPdRrsSBuy7zloVLkZyQiQ0cYPwfvc1qhbWWD1aJTKFEMv c3feyu2hRgXYEh/KpaJPgmQXO4RD5x3jcxIiTOYhHt5mg6mTqLX4es4otIHKB930tooe Rf7ZGKxjR3VGQzAhbNMobilGSPq7YiSi8gFRoDyIp7LmZ5G8VmWO3yrbr8GsuoDkTZBL hmHg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737009847; x=1737614647; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=c7oE9ukq+yCo3w9jA5kMWwq0Sl7/pDWcjCQ6Z40t+Jc=; b=VsTA8ZGxVxKie4jdnjTC1su0JXmqD3fzEHKwJSpeMl0Gv0plvRcjmQHsBnFUsWdMpI iJI9iUaf9j/jBy2mLmsEkoLaBVECFSCX8CaNX2fNtDNcIrDahGyCsA6f4gQXTvnCwhZN hyLjXS3O80ZKZC+iD/AvXdd42NL61H3Q0282bS2gFCyi46oDQuJanGsSfQcATLm0JmIX YVTaWPeJ+nRcN6Yc6XMzh5e/jHXQe/ZGd9NzmFbgLPr1JryFuPNll2koXoq8VJ4XpbZS HMq2ZznqSdwN3IjXH6yr0SKkJMVWdUaxH6kNox84xlvS5c7tzldZEmJx8ft5+H0J1YSC NJ0g== X-Forwarded-Encrypted: i=1; AJvYcCXjTpcweMzns+c4OdJXMXqG025xKmX6yN1UC8zLNo+h5h2tD3oUj+4BTrWsDTgddQyfzJQz0NHe5v9BFlw=@vger.kernel.org X-Gm-Message-State: AOJu0YwQVzNqDHUFgISXtr4ck0nuzOwJeCP76kY6idnaT5JTekonswTm WWxpgHYM3KYOJEn1c4WZPRU8BelTlga5TNBoK2A4xnu2SRkHspLEGJmOMdXjncHb0Wibxvs9PBM Mqg89zA== X-Google-Smtp-Source: AGHT+IEL+jhz216ohkfpJoZOq6uTfSCY4caHLYQYdUvazvFliIVLA5wchlYSsxtYpCO9M9r8aOmQ2gtLJOCL X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:7bd:ce19:c701:bfa3]) (user=irogers job=sendgmr) by 2002:a25:260d:0:b0:e57:39fd:22ce with SMTP id 3f1490d57ef6-e5739fd240cmr41899276.5.1737009847579; Wed, 15 Jan 2025 22:44:07 -0800 (PST) Date: Wed, 15 Jan 2025 22:43:34 -0800 In-Reply-To: <20250116064355.344245-1-irogers@google.com> Message-Id: <20250116064355.344245-3-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250116064355.344245-1-irogers@google.com> X-Mailer: git-send-email 2.48.0.rc2.279.g1de40edade-goog Subject: [PATCH v2 02/23] perf vendor events: Update AlderlakeN events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.27 to v1.28. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v1.28: https://github.com/intel/perfmon/commit/801f43f22ec6bd23fbb5d18860f395d61e7= f4081 The TMA 5.01 update is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-developed-by: Caleb Biggers Signed-off-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../arch/x86/alderlaken/adln-metrics.json | 83 ++++--- .../pmu-events/arch/x86/alderlaken/cache.json | 227 ++++++++++++++++-- .../arch/x86/alderlaken/floating-point.json | 17 +- .../arch/x86/alderlaken/memory.json | 20 ++ .../pmu-events/arch/x86/alderlaken/other.json | 81 ++++++- .../arch/x86/alderlaken/pipeline.json | 97 +++++--- .../arch/x86/alderlaken/virtual-memory.json | 30 +++ tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 8 files changed, 458 insertions(+), 99 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/alderlaken/adln-metrics.json b/= tools/perf/pmu-events/arch/x86/alderlaken/adln-metrics.json index 447596f924ab..141562def5c4 100644 --- a/tools/perf/pmu-events/arch/x86/alderlaken/adln-metrics.json +++ b/tools/perf/pmu-events/arch/x86/alderlaken/adln-metrics.json @@ -48,13 +48,6 @@ "MetricName": "C7_Core_Residency", "ScaleUnit": "100%" }, - { - "BriefDescription": "C7 residency percent per package", - "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C7_Pkg_Residency", - "ScaleUnit": "100%" - }, { "BriefDescription": "C8 residency percent per package", "MetricExpr": "cstate_pkg@c8\\-residency@ / TSC", @@ -62,13 +55,6 @@ "MetricName": "C8_Pkg_Residency", "ScaleUnit": "100%" }, - { - "BriefDescription": "C9 residency percent per package", - "MetricExpr": "cstate_pkg@c9\\-residency@ / TSC", - "MetricGroup": "Power", - "MetricName": "C9_Pkg_Residency", - "ScaleUnit": "100%" - }, { "BriefDescription": "Percentage of cycles spent in System Manageme= nt Interrupts.", "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0= else 0)", @@ -89,7 +75,7 @@ "MetricExpr": "tma_core_bound", "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_allocation_restriction", - "MetricThreshold": "tma_allocation_restriction > 0.1 & (tma_core_b= ound > 0.1 & tma_backend_bound > 0.1)", + "MetricThreshold": "(tma_allocation_restriction >0.10) & ((tma_cor= e_bound >0.10) & ((tma_backend_bound >0.10)))", "ScaleUnit": "100%" }, { @@ -98,7 +84,7 @@ "MetricExpr": "TOPDOWN_BE_BOUND.ALL / (5 * CPU_CLK_UNHALTED.CORE)", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", - "MetricThreshold": "tma_backend_bound > 0.1", + "MetricThreshold": "(tma_backend_bound >0.10)", "MetricgroupNoGroup": "TopdownL1;Default", "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend due to backend stalls. Note that uops must= be available for consumption in order for this event to count. If a uop is= not available (IQ is empty), this event will not count", "ScaleUnit": "100%" @@ -109,7 +95,7 @@ "MetricExpr": "(5 * CPU_CLK_UNHALTED.CORE - (TOPDOWN_FE_BOUND.ALL = + TOPDOWN_BE_BOUND.ALL + TOPDOWN_RETIRING.ALL)) / (5 * CPU_CLK_UNHALTED.COR= E)", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_bad_speculation", - "MetricThreshold": "tma_bad_speculation > 0.15", + "MetricThreshold": "(tma_bad_speculation >0.15)", "MetricgroupNoGroup": "TopdownL1;Default", "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend because allocation is stalled due to a mis= predicted jump or a machine clear. Only issue slots wasted due to fast nuke= s such as memory ordering nukes are counted. Other nukes are not accounted = for. Counts all issue slots blocked during this recovery window including r= elevant microcode flows and while uops are not yet available in the instruc= tion queue (IQ). Also includes the issue slots that were consumed by the ba= ckend but were thrown away because they were younger than the mispredict or= machine clear.", "ScaleUnit": "100%" @@ -119,7 +105,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.BRANCH_DETECT / (5 * CPU_CLK_UNHAL= TED.CORE)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_branch_detect", - "MetricThreshold": "tma_branch_detect > 0.05 & (tma_ifetch_latency= > 0.15 & tma_frontend_bound > 0.2)", + "MetricThreshold": "(tma_branch_detect >0.05) & ((tma_ifetch_laten= cy >0.15) & ((tma_frontend_bound >0.20)))", "PublicDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to BACLEARS, which occurs when the Branch = Target Buffer (BTB) prediction or lack thereof, was corrected by a later br= anch predictor in the frontend. Includes BACLEARS due to all branch types i= ncluding conditional and unconditional jumps, returns, and indirect branche= s.", "ScaleUnit": "100%" }, @@ -128,7 +114,7 @@ "MetricExpr": "TOPDOWN_BAD_SPECULATION.MISPREDICT / (5 * CPU_CLK_U= NHALTED.CORE)", "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", "MetricName": "tma_branch_mispredicts", - "MetricThreshold": "tma_branch_mispredicts > 0.05 & tma_bad_specul= ation > 0.15", + "MetricThreshold": "(tma_branch_mispredicts >0.05) & ((tma_bad_spe= culation >0.15))", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, @@ -137,7 +123,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.BRANCH_RESTEER / (5 * CPU_CLK_UNHA= LTED.CORE)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_branch_resteer", - "MetricThreshold": "tma_branch_resteer > 0.05 & (tma_ifetch_latenc= y > 0.15 & tma_frontend_bound > 0.2)", + "MetricThreshold": "(tma_branch_resteer >0.05) & ((tma_ifetch_late= ncy >0.15) & ((tma_frontend_bound >0.20)))", "ScaleUnit": "100%" }, { @@ -145,7 +131,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.CISC / (5 * CPU_CLK_UNHALTED.CORE)= ", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.05 & (tma_ifetch_bandwidth > 0.1 = & tma_frontend_bound > 0.2)", + "MetricThreshold": "(tma_cisc >0.05) & ((tma_ifetch_bandwidth >0.1= 0) & ((tma_frontend_bound >0.20)))", "ScaleUnit": "100%" }, { @@ -153,7 +139,7 @@ "MetricExpr": "TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS / (5 * CPU_CLK_= UNHALTED.CORE)", "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", "MetricName": "tma_core_bound", - "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.1= ", + "MetricThreshold": "(tma_core_bound >0.10) & ((tma_backend_bound >= 0.10))", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, @@ -162,7 +148,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.DECODE / (5 * CPU_CLK_UNHALTED.COR= E)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", "MetricName": "tma_decode", - "MetricThreshold": "tma_decode > 0.05 & (tma_ifetch_bandwidth > 0.= 1 & tma_frontend_bound > 0.2)", + "MetricThreshold": "(tma_decode >0.05) & ((tma_ifetch_bandwidth >0= .10) & ((tma_frontend_bound >0.20)))", "ScaleUnit": "100%" }, { @@ -170,7 +156,7 @@ "MetricExpr": "TOPDOWN_BAD_SPECULATION.FASTNUKE / (5 * CPU_CLK_UNH= ALTED.CORE)", "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", "MetricName": "tma_fast_nuke", - "MetricThreshold": "tma_fast_nuke > 0.05 & (tma_machine_clears > 0= .05 & tma_bad_speculation > 0.15)", + "MetricThreshold": "(tma_fast_nuke >0.05) & ((tma_machine_clears >= 0.05) & ((tma_bad_speculation >0.15)))", "ScaleUnit": "100%" }, { @@ -179,7 +165,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.ALL / (5 * CPU_CLK_UNHALTED.CORE)", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_frontend_bound", - "MetricThreshold": "tma_frontend_bound > 0.2", + "MetricThreshold": "(tma_frontend_bound >0.20)", "MetricgroupNoGroup": "TopdownL1;Default", "ScaleUnit": "100%" }, @@ -188,7 +174,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.ICACHE / (5 * CPU_CLK_UNHALTED.COR= E)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_ifetch_latency= > 0.15 & tma_frontend_bound > 0.2)", + "MetricThreshold": "(tma_icache_misses >0.05) & ((tma_ifetch_laten= cy >0.15) & ((tma_frontend_bound >0.20)))", "ScaleUnit": "100%" }, { @@ -196,7 +182,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH / (5 * CPU_CLK_= UNHALTED.CORE)", "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", "MetricName": "tma_ifetch_bandwidth", - "MetricThreshold": "tma_ifetch_bandwidth > 0.1 & tma_frontend_boun= d > 0.2", + "MetricThreshold": "(tma_ifetch_bandwidth >0.10) & ((tma_frontend_= bound >0.20))", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, @@ -205,7 +191,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.FRONTEND_LATENCY / (5 * CPU_CLK_UN= HALTED.CORE)", "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", "MetricName": "tma_ifetch_latency", - "MetricThreshold": "tma_ifetch_latency > 0.15 & tma_frontend_bound= > 0.2", + "MetricThreshold": "(tma_ifetch_latency >0.15) & ((tma_frontend_bo= und >0.20))", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, @@ -320,6 +306,11 @@ "MetricExpr": "100 * MEM_BOUND_STALLS.IFETCH_L2_HIT / MEM_BOUND_ST= ALLS.IFETCH", "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 2hit" }, + { + "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss doesn't hit in the L2", + "MetricExpr": "100 * (MEM_BOUND_STALLS.IFETCH_LLC_HIT + MEM_BOUND_= STALLS.IFETCH_DRAM_HIT) / MEM_BOUND_STALLS.IFETCH", + "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 2miss" + }, { "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L3", "MetricExpr": "100 * MEM_BOUND_STALLS.IFETCH_LLC_HIT / MEM_BOUND_S= TALLS.IFETCH", @@ -336,6 +327,12 @@ "MetricGroup": "load_store_bound", "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l2hit" }, + { + "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that subsequently misses in the L2", + "MetricExpr": "100 * (MEM_BOUND_STALLS.LOAD_LLC_HIT + MEM_BOUND_ST= ALLS.LOAD_DRAM_HIT) / MEM_BOUND_STALLS.LOAD", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l2mis= s" + }, { "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L3", "MetricExpr": "100 * MEM_BOUND_STALLS.LOAD_LLC_HIT / MEM_BOUND_STA= LLS.LOAD", @@ -472,6 +469,12 @@ "MetricGroup": "Summary", "MetricName": "tma_info_system_kernel_utilization" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.CORE_P / CPU_CLK_UNHALTED.CORE", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "((tma_info_system_mux > 1.1)|(tma_info_system_= mux < 0.9))" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "CPU_CLK_UNHALTED.CORE / CPU_CLK_UNHALTED.REF_TSC", @@ -503,7 +506,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.ITLB / (5 * CPU_CLK_UNHALTED.CORE)= ", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_ifetch_latency >= 0.15 & tma_frontend_bound > 0.2)", + "MetricThreshold": "(tma_itlb_misses >0.05) & ((tma_ifetch_latency= >0.15) & ((tma_frontend_bound >0.20)))", "ScaleUnit": "100%" }, { @@ -511,7 +514,7 @@ "MetricExpr": "TOPDOWN_BAD_SPECULATION.MACHINE_CLEARS / (5 * CPU_C= LK_UNHALTED.CORE)", "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", "MetricName": "tma_machine_clears", - "MetricThreshold": "tma_machine_clears > 0.05 & tma_bad_speculatio= n > 0.15", + "MetricThreshold": "(tma_machine_clears >0.05) & ((tma_bad_specula= tion >0.15))", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, @@ -520,7 +523,7 @@ "MetricExpr": "TOPDOWN_BE_BOUND.MEM_SCHEDULER / (5 * CPU_CLK_UNHAL= TED.CORE)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_mem_scheduler", - "MetricThreshold": "tma_mem_scheduler > 0.1 & (tma_resource_bound = > 0.2 & tma_backend_bound > 0.1)", + "MetricThreshold": "(tma_mem_scheduler >0.10) & ((tma_resource_bou= nd >0.20) & ((tma_backend_bound >0.10)))", "ScaleUnit": "100%" }, { @@ -528,7 +531,7 @@ "MetricExpr": "TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER / (5 * CPU_CLK_U= NHALTED.CORE)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_non_mem_scheduler", - "MetricThreshold": "tma_non_mem_scheduler > 0.1 & (tma_resource_bo= und > 0.2 & tma_backend_bound > 0.1)", + "MetricThreshold": "(tma_non_mem_scheduler >0.10) & ((tma_resource= _bound >0.20) & ((tma_backend_bound >0.10)))", "ScaleUnit": "100%" }, { @@ -536,7 +539,7 @@ "MetricExpr": "TOPDOWN_BAD_SPECULATION.NUKE / (5 * CPU_CLK_UNHALTE= D.CORE)", "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", "MetricName": "tma_nuke", - "MetricThreshold": "tma_nuke > 0.05 & (tma_machine_clears > 0.05 &= tma_bad_speculation > 0.15)", + "MetricThreshold": "(tma_nuke >0.05) & ((tma_machine_clears >0.05)= & ((tma_bad_speculation >0.15)))", "ScaleUnit": "100%" }, { @@ -544,7 +547,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.OTHER / (5 * CPU_CLK_UNHALTED.CORE= )", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", "MetricName": "tma_other_fb", - "MetricThreshold": "tma_other_fb > 0.05 & (tma_ifetch_bandwidth > = 0.1 & tma_frontend_bound > 0.2)", + "MetricThreshold": "(tma_other_fb >0.05) & ((tma_ifetch_bandwidth = >0.10) & ((tma_frontend_bound >0.20)))", "ScaleUnit": "100%" }, { @@ -552,7 +555,7 @@ "MetricExpr": "TOPDOWN_FE_BOUND.PREDECODE / (5 * CPU_CLK_UNHALTED.= CORE)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", "MetricName": "tma_predecode", - "MetricThreshold": "tma_predecode > 0.05 & (tma_ifetch_bandwidth >= 0.1 & tma_frontend_bound > 0.2)", + "MetricThreshold": "(tma_predecode >0.05) & ((tma_ifetch_bandwidth= >0.10) & ((tma_frontend_bound >0.20)))", "ScaleUnit": "100%" }, { @@ -560,7 +563,7 @@ "MetricExpr": "TOPDOWN_BE_BOUND.REGISTER / (5 * CPU_CLK_UNHALTED.C= ORE)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_register", - "MetricThreshold": "tma_register > 0.1 & (tma_resource_bound > 0.2= & tma_backend_bound > 0.1)", + "MetricThreshold": "(tma_register >0.10) & ((tma_resource_bound >0= .20) & ((tma_backend_bound >0.10)))", "ScaleUnit": "100%" }, { @@ -568,7 +571,7 @@ "MetricExpr": "TOPDOWN_BE_BOUND.REORDER_BUFFER / (5 * CPU_CLK_UNHA= LTED.CORE)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_reorder_buffer", - "MetricThreshold": "tma_reorder_buffer > 0.1 & (tma_resource_bound= > 0.2 & tma_backend_bound > 0.1)", + "MetricThreshold": "(tma_reorder_buffer >0.10) & ((tma_resource_bo= und >0.20) & ((tma_backend_bound >0.10)))", "ScaleUnit": "100%" }, { @@ -576,7 +579,7 @@ "MetricExpr": "tma_backend_bound - tma_core_bound", "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", "MetricName": "tma_resource_bound", - "MetricThreshold": "tma_resource_bound > 0.2 & tma_backend_bound >= 0.1", + "MetricThreshold": "(tma_resource_bound >0.20) & ((tma_backend_bou= nd >0.10))", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, @@ -586,7 +589,7 @@ "MetricExpr": "TOPDOWN_RETIRING.ALL / (5 * CPU_CLK_UNHALTED.CORE)", "MetricGroup": "Default;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", - "MetricThreshold": "tma_retiring > 0.75", + "MetricThreshold": "(tma_retiring >0.75)", "MetricgroupNoGroup": "TopdownL1;Default", "ScaleUnit": "100%" }, @@ -595,7 +598,7 @@ "MetricExpr": "TOPDOWN_BE_BOUND.SERIALIZATION / (5 * CPU_CLK_UNHAL= TED.CORE)", "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", "MetricName": "tma_serialization", - "MetricThreshold": "tma_serialization > 0.1 & (tma_resource_bound = > 0.2 & tma_backend_bound > 0.1)", + "MetricThreshold": "(tma_serialization >0.10) & ((tma_resource_bou= nd >0.20) & ((tma_backend_bound >0.10)))", "ScaleUnit": "100%" } ] diff --git a/tools/perf/pmu-events/arch/x86/alderlaken/cache.json b/tools/p= erf/pmu-events/arch/x86/alderlaken/cache.json index 1500033ee19f..fd9ed58c2f90 100644 --- a/tools/perf/pmu-events/arch/x86/alderlaken/cache.json +++ b/tools/perf/pmu-events/arch/x86/alderlaken/cache.json @@ -1,4 +1,30 @@ [ + { + "BriefDescription": "Counts the total number of L2 Cache accesses.= Counts on a per core basis.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0x24", + "EventName": "L2_REQUEST.ALL", + "PublicDescription": "Counts the total number of L2 Cache Accesses= , includes hits, misses, rejects front door requests for CRd/DRd/RFO/ItoM/= L2 Prefetches only. Counts on a per core basis.", + "SampleAfterValue": "200003" + }, + { + "BriefDescription": "Counts the number of L2 Cache accesses that r= esulted in a hit. Counts on a per core basis.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0x24", + "EventName": "L2_REQUEST.HIT", + "PublicDescription": "Counts the number of L2 Cache accesses that = resulted in a hit from a front door request only (does not include rejects = or recycles), Counts on a per core basis.", + "SampleAfterValue": "200003", + "UMask": "0x2" + }, + { + "BriefDescription": "Counts the number of L2 Cache accesses that r= esulted in a miss. Counts on a per core basis.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0x24", + "EventName": "L2_REQUEST.MISS", + "PublicDescription": "Counts the number of L2 Cache accesses that = resulted in a miss from a front door request only (does not include rejects= or recycles). Counts on a per core basis.", + "SampleAfterValue": "200003", + "UMask": "0x1" + }, { "BriefDescription": "Counts the number of cacheable memory request= s that miss in the LLC. Counts on a per core basis.", "Counter": "0,1,2,3,4,5", @@ -92,30 +118,81 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_UOPS_RETIRED.DRAM_HIT", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x80" }, + { + "BriefDescription": "Counts the number of load uops retired that h= it in the L3 cache, in which a snoop was required and modified data was for= warded from another core or module.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.HITM", + "SampleAfterValue": "200003", + "UMask": "0x20" + }, + { + "BriefDescription": "Counts the number of load uops retired that h= it in the L1 data cache.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L1_HIT", + "SampleAfterValue": "200003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts the number of load uops retired that m= iss in the L1 data cache.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L1_MISS", + "SampleAfterValue": "200003", + "UMask": "0x8" + }, { "BriefDescription": "Counts the number of load uops retired that h= it in the L2 cache.", "Counter": "0,1,2,3,4,5", "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_UOPS_RETIRED.L2_HIT", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x2" }, + { + "BriefDescription": "Counts the number of load uops retired that m= iss in the L2 cache.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L2_MISS", + "SampleAfterValue": "200003", + "UMask": "0x10" + }, { "BriefDescription": "Counts the number of load uops retired that h= it in the L3 cache.", "Counter": "0,1,2,3,4,5", "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_UOPS_RETIRED.L3_HIT", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x4" }, + { + "BriefDescription": "Counts the number of load uops retired that h= it in the L3 cache, in which a snoop was required, and non-modified data wa= s forwarded.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd2", + "EventName": "MEM_LOAD_UOPS_RETIRED_MISC.HIT_E_F", + "SampleAfterValue": "1000003", + "UMask": "0x40" + }, + { + "BriefDescription": "Counts the number of load uops retired that m= iss in the L3 cache.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd2", + "EventName": "MEM_LOAD_UOPS_RETIRED_MISC.L3_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x20" + }, { "BriefDescription": "Counts the number of cycles that uops are blo= cked for any of the following reasons: load buffer, store buffer or RSV fu= ll.", "Counter": "0,1,2,3,4,5", @@ -154,7 +231,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_UOPS_RETIRED.ALL_LOADS", - "PEBS": "1", "PublicDescription": "Counts the total number of load uops retired= .", "SampleAfterValue": "200003", "UMask": "0x81" @@ -165,7 +241,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_UOPS_RETIRED.ALL_STORES", - "PEBS": "1", "PublicDescription": "Counts the total number of store uops retire= d.", "SampleAfterValue": "200003", "UMask": "0x82" @@ -178,7 +253,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_128", "MSRIndex": "0x3F6", "MSRValue": "0x80", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 128 cycles as def= ined in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled.= If a PEBS record is generated, will populate the PEBS Latency and PEBS Dat= a Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5" @@ -191,7 +265,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_16", "MSRIndex": "0x3F6", "MSRValue": "0x10", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 16 cycles as defi= ned in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled. = If a PEBS record is generated, will populate the PEBS Latency and PEBS Data= Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5" @@ -204,7 +277,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_256", "MSRIndex": "0x3F6", "MSRValue": "0x100", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 256 cycles as def= ined in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled.= If a PEBS record is generated, will populate the PEBS Latency and PEBS Dat= a Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5" @@ -217,7 +289,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_32", "MSRIndex": "0x3F6", "MSRValue": "0x20", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 32 cycles as defi= ned in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled. = If a PEBS record is generated, will populate the PEBS Latency and PEBS Data= Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5" @@ -230,7 +301,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_4", "MSRIndex": "0x3F6", "MSRValue": "0x4", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 4 cycles as defin= ed in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled. I= f a PEBS record is generated, will populate the PEBS Latency and PEBS Data = Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5" @@ -243,7 +313,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_512", "MSRIndex": "0x3F6", "MSRValue": "0x200", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 512 cycles as def= ined in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled.= If a PEBS record is generated, will populate the PEBS Latency and PEBS Dat= a Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5" @@ -256,7 +325,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_64", "MSRIndex": "0x3F6", "MSRValue": "0x40", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 64 cycles as defi= ned in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled. = If a PEBS record is generated, will populate the PEBS Latency and PEBS Data= Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5" @@ -269,7 +337,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_8", "MSRIndex": "0x3F6", "MSRValue": "0x8", - "PEBS": "2", "PublicDescription": "Counts the number of tagged loads with an in= struction latency that exceeds or equals the threshold of 8 cycles as defin= ed in MEC_CR_PEBS_LD_LAT_THRESHOLD (3F6H). Only counts with PEBS enabled. I= f a PEBS record is generated, will populate the PEBS Latency and PEBS Data = Source fields accordingly.", "SampleAfterValue": "1000003", "UMask": "0x5" @@ -280,7 +347,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_UOPS_RETIRED.LOCK_LOADS", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x21" }, @@ -290,28 +356,93 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_UOPS_RETIRED.SPLIT_LOADS", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x41" }, + { + "BriefDescription": "Counts the total number of load and store uop= s retired that missed in the second level TLB.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS", + "SampleAfterValue": "200003", + "UMask": "0x13" + }, + { + "BriefDescription": "Counts the number of load ops retired that mi= ss in the second Level TLB.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x11" + }, + { + "BriefDescription": "Counts the number of store ops retired that m= iss in the second level TLB.", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS_STORES", + "SampleAfterValue": "200003", + "UMask": "0x12" + }, { "BriefDescription": "Counts the number of stores uops retired. Cou= nts with or without PEBS enabled.", "Counter": "0,1,2,3,4,5", "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_UOPS_RETIRED.STORE_LATENCY", - "PEBS": "2", "PublicDescription": "Counts the number of stores uops retired. Co= unts with or without PEBS enabled. If PEBS is enabled and a PEBS record is = generated, will populate PEBS Latency and PEBS Data Source fields according= ly.", "SampleAfterValue": "1000003", "UMask": "0x6" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by the L3 cache.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.L3_HIT", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1F803C0004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by the L3 cache where a snoop w= as sent, the snoop hit, and modified data was forwarded.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.L3_HIT.SNOOP_HITM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10003C0004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by the L3 cache where a snoop w= as sent, the snoop hit, but no data was forwarded.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.L3_HIT.SNOOP_HIT_NO_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x4003C0004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by the L3 cache where a snoop w= as sent, the snoop hit, and non-modified data was forwarded.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.L3_HIT.SNOOP_HIT_WITH_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x8003C0004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data reads that were supplied b= y the L3 cache.", "Counter": "0,1,2,3,4,5", "EventCode": "0xB7", "EventName": "OCR.DEMAND_DATA_RD.L3_HIT", "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3F803C0001", + "MSRValue": "0x1F803C0001", "SampleAfterValue": "100003", "UMask": "0x1" }, @@ -351,7 +482,7 @@ "EventCode": "0xB7", "EventName": "OCR.DEMAND_RFO.L3_HIT", "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3F803C0002", + "MSRValue": "0x1F803C0002", "SampleAfterValue": "100003", "UMask": "0x1" }, @@ -365,6 +496,66 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y the L3 cache where a snoop was sent, the snoop hit, but no data was forwa= rded.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_RFO.L3_HIT.SNOOP_HIT_NO_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x4003C0002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y the L3 cache where a snoop was sent, the snoop hit, and non-modified data= was forwarded.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_RFO.L3_HIT.SNOOP_HIT_WITH_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x8003C0002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were supplied by the L3 = cache.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.L3_HIT", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1F803C4000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were supplied by the L3 = cache where a snoop was sent, the snoop hit, and modified data was forwarde= d.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.L3_HIT.SNOOP_HITM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10003C4000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were supplied by the L3 = cache where a snoop was sent, the snoop hit, but no data was forwarded.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.L3_HIT.SNOOP_HIT_NO_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x4003C4000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were supplied by the L3 = cache where a snoop was sent, the snoop hit, and non-modified data was forw= arded.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.L3_HIT.SNOOP_HIT_WITH_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x8003C4000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to instruction cache misses.", "Counter": "0,1,2,3,4,5", diff --git a/tools/perf/pmu-events/arch/x86/alderlaken/floating-point.json = b/tools/perf/pmu-events/arch/x86/alderlaken/floating-point.json index 484d8b3167f0..ed963fcb6485 100644 --- a/tools/perf/pmu-events/arch/x86/alderlaken/floating-point.json +++ b/tools/perf/pmu-events/arch/x86/alderlaken/floating-point.json @@ -1,4 +1,20 @@ [ + { + "BriefDescription": "Counts the number of cycles the floating poin= t divider is in the loop stage.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xcd", + "EventName": "ARITH.FPDIV_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x2" + }, + { + "BriefDescription": "Counts the number of floating point divider u= ops executed per cycle.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xcd", + "EventName": "ARITH.FPDIV_UOPS", + "SampleAfterValue": "1000003", + "UMask": "0x8" + }, { "BriefDescription": "Counts the number of floating point operation= s retired that required microcode assist.", "Counter": "0,1,2,3,4,5", @@ -13,7 +29,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc2", "EventName": "UOPS_RETIRED.FPDIV", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x8" } diff --git a/tools/perf/pmu-events/arch/x86/alderlaken/memory.json b/tools/= perf/pmu-events/arch/x86/alderlaken/memory.json index 619488d42a4a..3b46b048dfb2 100644 --- a/tools/perf/pmu-events/arch/x86/alderlaken/memory.json +++ b/tools/perf/pmu-events/arch/x86/alderlaken/memory.json @@ -56,6 +56,16 @@ "SampleAfterValue": "20003", "UMask": "0x2" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were not supplied by the L3 cache.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.L3_MISS", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F84400004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data reads that were not suppli= ed by the L3 cache.", "Counter": "0,1,2,3,4,5", @@ -95,5 +105,15 @@ "MSRValue": "0x3F84400002", "SampleAfterValue": "100003", "UMask": "0x1" + }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were not supplied by the= L3 cache.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.L3_MISS", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F84404000", + "SampleAfterValue": "100003", + "UMask": "0x1" } ] diff --git a/tools/perf/pmu-events/arch/x86/alderlaken/other.json b/tools/p= erf/pmu-events/arch/x86/alderlaken/other.json index 54ddbe2b3b9b..f8c21b7f8f40 100644 --- a/tools/perf/pmu-events/arch/x86/alderlaken/other.json +++ b/tools/perf/pmu-events/arch/x86/alderlaken/other.json @@ -5,7 +5,6 @@ "Deprecated": "1", "EventCode": "0xe4", "EventName": "LBR_INSERTS.ANY", - "PEBS": "1", "SampleAfterValue": "1000003", "UMask": "0x1" }, @@ -19,6 +18,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that have any type of response.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x784000004", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data reads that have any type o= f response.", "Counter": "0,1,2,3,4,5", @@ -29,6 +48,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_DATA_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x784000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that have any type o= f response.", "Counter": "0,1,2,3,4,5", @@ -39,6 +68,36 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that were supplied b= y DRAM.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_RFO.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x784000002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts streaming stores which modify a full 6= 4 byte cacheline that have any type of response.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.FULL_STREAMING_WR.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x800000010000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts streaming stores which modify only par= t of a 64 byte cacheline that have any type of response.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.PARTIAL_STREAMING_WR.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x400000010000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts streaming stores that have any type of= response.", "Counter": "0,1,2,3,4,5", @@ -49,6 +108,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that have any type of respons= e.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x14000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts L1 data cache software prefetches whic= h include T0/T1/T2 and NTA (except PREFETCHW) that were supplied by DRAM.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xB7", + "EventName": "OCR.SWPF_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x784004000", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts the number of issue slots in a UMWAIT = or TPAUSE instruction where no uop issues due to the instruction putting th= e CPU into the C0.1 activity state. For Tremont, UMWAIT and TPAUSE will onl= y put the CPU into C0.1 activity state (not C0.2 activity state)", "Counter": "0,1,2,3,4,5", diff --git a/tools/perf/pmu-events/arch/x86/alderlaken/pipeline.json b/tool= s/perf/pmu-events/arch/x86/alderlaken/pipeline.json index f05db45578ff..713ebc21cec0 100644 --- a/tools/perf/pmu-events/arch/x86/alderlaken/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/alderlaken/pipeline.json @@ -1,10 +1,59 @@ [ + { + "BriefDescription": "Counts the number of cycles when any of the f= loating point or integer dividers are active.", + "Counter": "0,1,2,3,4,5", + "CounterMask": "1", + "EventCode": "0xcd", + "EventName": "ARITH.DIV_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x3" + }, + { + "BriefDescription": "Counts the number of active floating point an= d integer dividers per cycle.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xcd", + "EventName": "ARITH.DIV_OCCUPANCY", + "SampleAfterValue": "1000003", + "UMask": "0x3" + }, + { + "BriefDescription": "Counts the number of floating point and integ= er divider uops executed per cycle.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xcd", + "EventName": "ARITH.DIV_UOPS", + "SampleAfterValue": "1000003", + "UMask": "0xc" + }, + { + "BriefDescription": "Counts the number of cycles any of the two in= teger dividers are active.", + "Counter": "0,1,2,3,4,5", + "CounterMask": "1", + "EventCode": "0xcd", + "EventName": "ARITH.IDIV_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts the number of active integer dividers = per cycle.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xcd", + "EventName": "ARITH.IDIV_OCCUPANCY", + "SampleAfterValue": "1000003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts the number of integer divider uops exe= cuted per cycle.", + "Counter": "0,1,2,3,4,5", + "EventCode": "0xcd", + "EventName": "ARITH.IDIV_UOPS", + "SampleAfterValue": "1000003", + "UMask": "0x4" + }, { "BriefDescription": "Counts the total number of branch instruction= s retired for all branch types.", "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts the total number of instructions in w= hich the instruction pointer (IP) of the processor is resteered due to a br= anch instruction and the branch instruction successfully retires. All bran= ch type instructions are accounted for.", "SampleAfterValue": "200003" }, @@ -14,7 +63,6 @@ "Deprecated": "1", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.CALL", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xf9" }, @@ -23,7 +71,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x7e" }, @@ -32,7 +79,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_TAKEN", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfe" }, @@ -41,7 +87,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.FAR_BRANCH", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xbf" }, @@ -50,7 +95,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.INDIRECT", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xeb" }, @@ -59,7 +103,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.INDIRECT_CALL", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfb" }, @@ -69,7 +112,6 @@ "Deprecated": "1", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.IND_CALL", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfb" }, @@ -79,7 +121,6 @@ "Deprecated": "1", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.JCC", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x7e" }, @@ -88,7 +129,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_CALL", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xf9" }, @@ -97,7 +137,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_RETURN", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xf7" }, @@ -106,7 +145,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_TAKEN", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xc0" }, @@ -116,7 +154,6 @@ "Deprecated": "1", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NON_RETURN_IND", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xeb" }, @@ -125,7 +162,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.REL_CALL", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfd" }, @@ -135,7 +171,6 @@ "Deprecated": "1", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.RETURN", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xf7" }, @@ -145,7 +180,6 @@ "Deprecated": "1", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.TAKEN_JCC", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfe" }, @@ -154,7 +188,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts the total number of mispredicted bran= ch instructions retired. All branch type instructions are accounted for. = Prediction of the branch target address enables the processor to begin exec= uting instructions before the non-speculative execution path is known. The = branch prediction unit (BPU) predicts the target address based on the instr= uction pointer (IP) of the branch and on the execution path through which e= xecution reached this IP. A branch misprediction occurs when the predict= ion is wrong, and results in discarding all instructions executed in the sp= eculative path and re-fetching from the correct path.", "SampleAfterValue": "200003" }, @@ -163,7 +196,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x7e" }, @@ -172,7 +204,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_TAKEN", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfe" }, @@ -181,7 +212,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xeb" }, @@ -190,7 +220,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT_CALL", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfb" }, @@ -200,7 +229,6 @@ "Deprecated": "1", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.IND_CALL", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfb" }, @@ -210,7 +238,6 @@ "Deprecated": "1", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.JCC", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x7e" }, @@ -219,7 +246,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.NEAR_TAKEN", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x80" }, @@ -229,7 +255,6 @@ "Deprecated": "1", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.NON_RETURN_IND", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xeb" }, @@ -238,7 +263,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.RETURN", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xf7" }, @@ -248,7 +272,6 @@ "Deprecated": "1", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.TAKEN_JCC", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0xfe" }, @@ -268,6 +291,15 @@ "PublicDescription": "Counts the number of core cycles while the c= ore is not in a halt state. The core enters the halt state when it is runni= ng the HLT instruction. The core frequency may change from time to time. Fo= r this reason this event may have a changing ratio with regards to time. Th= is event uses a programmable general purpose performance counter.", "SampleAfterValue": "2000003" }, + { + "BriefDescription": "This event is deprecated. Refer to new event = CPU_CLK_UNHALTED.REF_TSC_P", + "Counter": "0,1,2,3,4,5", + "Deprecated": "1", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.REF", + "SampleAfterValue": "2000003", + "UMask": "0x1" + }, { "BriefDescription": "Counts the number of unhalted reference clock= cycles at TSC frequency. (Fixed event)", "Counter": "Fixed counter 2", @@ -305,7 +337,6 @@ "BriefDescription": "Counts the total number of instructions retir= ed. (Fixed event)", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts the total number of instructions that= retired. For instructions that consist of multiple uops, this event counts= the retirement of the last uop of the instruction. This event continues co= unting during hardware interrupts, traps, and inside interrupt handlers. Th= is event uses fixed counter 0.", "SampleAfterValue": "2000003", "UMask": "0x1" @@ -315,7 +346,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc0", "EventName": "INST_RETIRED.ANY_P", - "PEBS": "1", "PublicDescription": "Counts the total number of instructions that= retired. For instructions that consist of multiple uops, this event counts= the retirement of the last uop of the instruction. This event continues co= unting during hardware interrupts, traps, and inside interrupt handlers. Th= is event uses a programmable general purpose performance counter.", "SampleAfterValue": "2000003" }, @@ -325,7 +355,6 @@ "Deprecated": "1", "EventCode": "0x03", "EventName": "LD_BLOCKS.4K_ALIAS", - "PEBS": "1", "SampleAfterValue": "1000003", "UMask": "0x4" }, @@ -334,7 +363,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0x03", "EventName": "LD_BLOCKS.ADDRESS_ALIAS", - "PEBS": "1", "SampleAfterValue": "1000003", "UMask": "0x4" }, @@ -343,7 +371,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0x03", "EventName": "LD_BLOCKS.DATA_UNKNOWN", - "PEBS": "1", "SampleAfterValue": "1000003", "UMask": "0x1" }, @@ -392,7 +419,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xe4", "EventName": "MISC_RETIRED.LBR_INSERTS", - "PEBS": "1", "PublicDescription": "Counts the number of LBR entries recorded. R= equires LBRs to be enabled in IA32_LBR_CTL. This event is PDIR on GP0 and N= PEBS on all other GPs [This event is alias to LBR_INSERTS.ANY]", "SampleAfterValue": "1000003", "UMask": "0x1" @@ -588,7 +614,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc2", "EventName": "TOPDOWN_RETIRING.ALL", - "PEBS": "1", "SampleAfterValue": "1000003" }, { @@ -604,7 +629,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc2", "EventName": "UOPS_RETIRED.ALL", - "PEBS": "1", "SampleAfterValue": "2000003" }, { @@ -612,7 +636,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc2", "EventName": "UOPS_RETIRED.IDIV", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x10" }, @@ -621,7 +644,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc2", "EventName": "UOPS_RETIRED.MS", - "PEBS": "1", "PublicDescription": "Counts the number of uops that are from comp= lex flows issued by the Microcode Sequencer (MS). This includes uops from f= lows due to complex instructions, faults, assists, and inserted flows.", "SampleAfterValue": "2000003", "UMask": "0x1" @@ -631,7 +653,6 @@ "Counter": "0,1,2,3,4,5", "EventCode": "0xc2", "EventName": "UOPS_RETIRED.X87", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x2" } diff --git a/tools/perf/pmu-events/arch/x86/alderlaken/virtual-memory.json = b/tools/perf/pmu-events/arch/x86/alderlaken/virtual-memory.json index ad2b1349bab4..d9c737a17df0 100644 --- a/tools/perf/pmu-events/arch/x86/alderlaken/virtual-memory.json +++ b/tools/perf/pmu-events/arch/x86/alderlaken/virtual-memory.json @@ -49,5 +49,35 @@ "EventName": "LD_HEAD.DTLB_MISS_AT_RET", "SampleAfterValue": "1000003", "UMask": "0x90" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = MEM_UOPS_RETIRED.STLB_MISS", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "Deprecated": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.DTLB_MISS", + "SampleAfterValue": "200003", + "UMask": "0x13" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = MEM_UOPS_RETIRED.STLB_MISS_LOADS", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "Deprecated": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.DTLB_MISS_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x11" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = MEM_UOPS_RETIRED.STLB_MISS_STORES", + "Counter": "0,1,2,3,4,5", + "Data_LA": "1", + "Deprecated": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.DTLB_MISS_STORES", + "SampleAfterValue": "200003", + "UMask": "0x12" } ] diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index d9538723927b..2fa5529ee585 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -1,6 +1,6 @@ Family-model,Version,Filename,EventType GenuineIntel-6-(97|9A|B7|BA|BF),v1.28,alderlake,core -GenuineIntel-6-BE,v1.27,alderlaken,core +GenuineIntel-6-BE,v1.28,alderlaken,core GenuineIntel-6-(1C|26|27|35|36),v5,bonnell,core GenuineIntel-6-(3D|47),v29,broadwell,core GenuineIntel-6-56,v11,broadwellde,core --=20 2.48.0.rc2.279.g1de40edade-goog From nobody Sun Dec 14 13:49:51 2025 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 92151197A8E for ; Thu, 16 Jan 2025 06:44:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009917; cv=none; b=MRNvT2CFQBgb5ar2bRss19ex+RQlRHbXjSK9MnkKDlL95Xn0/4rT4ZbdEEV04vCeYnxkGIOg4GKj3Wsz0oPyVrM/qHtFY0Ns0zTDMZ8fM6PYhXA1ndS6XGAFQGn1wYvAcxQZXSCuuxaAm9oNzP/KaQ+WcSECLh5+wT15RKFF2rg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009917; c=relaxed/simple; bh=NU3bmnEoo1mId2ZCLVWcywvIjVtTrucMzElDsVlN9h4=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=Nvif1RKpVJ6+k54cf9gwu4Mrh/v634fW1Zt/w66NAV4nZG3Bwu0b695Xf2I4H01Ju1vFxoxy5grRcLae/6WNbXFUFD5y96j671p4uN3kA1OWyn5mxflgjb2jXg6awOPCgayEf4BDxT8lnY3bjYiiLSkp1LRMtXdHGwDIGut/X7U= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=WUiZZwVN; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="WUiZZwVN" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-e572cd106f7so1637068276.3 for ; Wed, 15 Jan 2025 22:44:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737009851; x=1737614651; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=J5SEXdSFrutDmrh80QlGMCqSBIkPiP/tceyEAOQYkb0=; b=WUiZZwVNA5c3NcG26l/Yt6QOTlgZ57FYOiZQiLMTwcEyqwlJeRAHEdKEi2yPYzntUS TBjdrkYsxUhCnmbAXITSktAEvFbhG+GlcgOPrj2DaGJagJ9JOnQopy0DNaiqu8CLbrjH W9PA94w3jEkQ0Tlu9lgEJ0lfiFdrBvEFYrj3idHSlOwMAT73v1LyntIpeFZsRmKHhxSx h3hOwUCSbNQ5htO4wlOd8DvyK/BfG1PM7cu4EZcczjOPKyl4khPn1C5TMlSMqT+Y31CY YSbh05ZSr2SF2a1t+BlbKhczCDceMXLH01wvilVGP8UA2fs5AIwLpkbsPnrBBwZ3akeY w4AQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737009851; x=1737614651; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=J5SEXdSFrutDmrh80QlGMCqSBIkPiP/tceyEAOQYkb0=; b=LNtl7YEa3MMbxCnau9FR6Gj3dteEC61UPAG+kdxmIhJTHXLYzOnObcrCptVJAlIy52 3VtBrXIeiC+AfZv4weobIQGczbnBWYM7cg3aLW5FtOwGeDyztzuLSe/LqF3nPW9HRDfj bw1Nji4GBuwn6IKK2W5wo5n2UFO0CpGxpgR3obK6AO9z6qRyY7XBUOhbWO//1F/ARAhM p9LMu59U4y1V5KKUaBLYEEjPw+GMlffX5KBbUpkreel1xq+6jNtKuHFO3sq9Fnht5odr M0diTQC/K3W3ybK4yW+kcEWBib/3KAfZC4j8DLIpRd14nIsScu4j7+NWT5N0cxY3oZo/ z/TA== X-Forwarded-Encrypted: i=1; AJvYcCULAlh2IAY5hdH5/6y05I4NI/TvkW1C8/09bFmLHV50iPtJ6WLomnvjB6Ae0CrtR3nLbNw2f+urUiZJ3f0=@vger.kernel.org X-Gm-Message-State: AOJu0YxJmSxXjamu8ZJ5kRZDHJKtil/RSBZPZeI0UJ0siSJgOLvb/x4V L5Px5hS0PSizeM0OIIwT/mRzN6BcUFAg1Zd9OYvbXz2qFXh5vUP5DfKTnWbh43VcyT6RuE6R5IF f87oKAQ== X-Google-Smtp-Source: AGHT+IHe8LnLgcF8YkJbMo/O8Ng+lOpC3GEZXRbiOjQb3cal6HFDjs2t36nsO1ZPQNFjphUMfN4pAOrSXc94 X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:7bd:ce19:c701:bfa3]) (user=irogers job=sendgmr) by 2002:a05:690c:2c90:b0:6f6:d314:49c9 with SMTP id 00721157ae682-6f6d3144af4mr64647b3.8.1737009850545; Wed, 15 Jan 2025 22:44:10 -0800 (PST) Date: Wed, 15 Jan 2025 22:43:35 -0800 In-Reply-To: <20250116064355.344245-1-irogers@google.com> Message-Id: <20250116064355.344245-4-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250116064355.344245-1-irogers@google.com> X-Mailer: git-send-email 2.48.0.rc2.279.g1de40edade-goog Subject: [PATCH v2 03/23] perf vendor events: Add Arrowlake events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add events v1.06. Add TMA metrics based on v5.01. Bring in the events from: https://github.com/intel/perfmon/tree/main/ARL/events TMA 5.01 is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-developed-by: Caleb Biggers Signed-off-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../arch/x86/arrowlake/arl-metrics.json | 4597 +++++++++++++++++ .../pmu-events/arch/x86/arrowlake/cache.json | 1471 ++++++ .../arch/x86/arrowlake/floating-point.json | 532 ++ .../arch/x86/arrowlake/frontend.json | 609 +++ .../pmu-events/arch/x86/arrowlake/memory.json | 387 ++ .../arch/x86/arrowlake/metricgroups.json | 150 + .../pmu-events/arch/x86/arrowlake/other.json | 279 + .../arch/x86/arrowlake/pipeline.json | 2307 +++++++++ .../arch/x86/arrowlake/uncore-cache.json | 20 + .../x86/arrowlake/uncore-interconnect.json | 47 + .../arch/x86/arrowlake/uncore-memory.json | 142 + .../arch/x86/arrowlake/uncore-other.json | 10 + .../arch/x86/arrowlake/virtual-memory.json | 522 ++ tools/perf/pmu-events/arch/x86/mapfile.csv | 1 + 14 files changed, 11074 insertions(+) create mode 100644 tools/perf/pmu-events/arch/x86/arrowlake/arl-metrics.js= on create mode 100644 tools/perf/pmu-events/arch/x86/arrowlake/cache.json create mode 100644 tools/perf/pmu-events/arch/x86/arrowlake/floating-point= .json create mode 100644 tools/perf/pmu-events/arch/x86/arrowlake/frontend.json create mode 100644 tools/perf/pmu-events/arch/x86/arrowlake/memory.json create mode 100644 tools/perf/pmu-events/arch/x86/arrowlake/metricgroups.j= son create mode 100644 tools/perf/pmu-events/arch/x86/arrowlake/other.json create mode 100644 tools/perf/pmu-events/arch/x86/arrowlake/pipeline.json create mode 100644 tools/perf/pmu-events/arch/x86/arrowlake/uncore-cache.j= son create mode 100644 tools/perf/pmu-events/arch/x86/arrowlake/uncore-interco= nnect.json create mode 100644 tools/perf/pmu-events/arch/x86/arrowlake/uncore-memory.= json create mode 100644 tools/perf/pmu-events/arch/x86/arrowlake/uncore-other.j= son create mode 100644 tools/perf/pmu-events/arch/x86/arrowlake/virtual-memory= .json diff --git a/tools/perf/pmu-events/arch/x86/arrowlake/arl-metrics.json b/to= ols/perf/pmu-events/arch/x86/arrowlake/arl-metrics.json new file mode 100644 index 000000000000..81979427de5c --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/arrowlake/arl-metrics.json @@ -0,0 +1,4597 @@ +[ + { + "BriefDescription": "C10 residency percent per package", + "MetricExpr": "cstate_pkg@c10\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C10_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C1 residency percent per core", + "MetricExpr": "cstate_core@c1\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C1_Core_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C2 residency percent per package", + "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C2_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C3 residency percent per package", + "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C6 residency percent per core", + "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Core_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C6 residency percent per package", + "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C7 residency percent per core", + "MetricExpr": "cstate_core@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Core_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C7 residency percent per package", + "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C8 residency percent per package", + "MetricExpr": "cstate_pkg@c8\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C8_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C9 residency percent per package", + "MetricExpr": "cstate_pkg@c9\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C9_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Percentage of cycles spent in System Manageme= nt Interrupts.", + "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0= else 0)", + "MetricGroup": "smi", + "MetricName": "smi_cycles", + "MetricThreshold": "smi_cycles > 0.1", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Number of SMI interrupts.", + "MetricExpr": "msr@smi@", + "MetricGroup": "smi", + "MetricName": "smi_num", + "ScaleUnit": "1SMI#" + }, + { + "BriefDescription": "Uncore frequency per die [GHZ]", + "MetricExpr": "tma_info_system_socket_clks / #num_dies / duration_= time / 1e9", + "MetricGroup": "SoC", + "MetricName": "UNCORE_FREQ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to certain allocation restrictions", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS@ / (8 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_allocation_restriction", + "MetricThreshold": "(tma_allocation_restriction >0.10) & ((tma_cor= e_bound >0.10) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "MetricExpr": "UOPS_DISPATCHED.ALU / (6 * tma_info_thread_clks)", + "MetricGroup": "Core_Execution;TopdownL5;tma_L5_group;tma_ports_ut= ilized_3m_group", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.4", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops delivered by the Microcode_Sequencer as a result of Assists= ", + "MetricExpr": "78 * ASSISTS.ANY / tma_info_thread_slots", + "MetricGroup": "BvIO;Slots_Estimated;TopdownL4;tma_L4_group;tma_mi= crocode_sequencer_group", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists", + "MetricExpr": "63 * ASSISTS.SSE_AVX_MIX / tma_info_thread_slots", + "MetricGroup": "HPC;Slots_Estimated;TopdownL5;tma_L5_group;tma_ass= ists_group", + "MetricName": "tma_avx_assists", + "MetricThreshold": "tma_avx_assists > 0.1", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", + "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.2", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This category represents fraction of slots wa= sted due to incorrect speculations", + "MetricExpr": "topdown\\-bad\\-spec / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Default;Fed;Frontend;IcMiss;Memo= ryTLB;Scaled_Slots;TopdownL1;tma_L1_group", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Default;Ret;Scaled_Slots;TopdownL1;tma_L1_gro= up", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_l1_latency_capacity + tma_lock_latency + tma_= split_loads + tma_fb_full)))", + "MetricGroup": "BvMB;Default;Mem;MemoryBW;Offcore;Scaled_Slots;Top= downL1;tma_L1_group;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_mem_b= andwidth, tma_sq_full", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;Default;Scaled_Slots;TopdownL1;tma_L1_gro= up;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D0x1@) * (t= ma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_res= teers + tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispr= edicts) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_bra= nches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma_= ms_switches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms / = (tma_mite + tma_dsb + tma_lsd + tma_ms))) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Default;Fed;FetchBW;Frontend;Scaled_Slots;Top= downL1;tma_L1_group", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu_core@U= OPS_RETIRED.MS\\,cmask\\=3D0x1@) * (tma_fetch_latency * (tma_ms_switches + = tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma= _other_mispredicts / tma_branch_mispredicts) / (tma_mispredicts_resteers + = tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_itl= b_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switch= es) + tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_lsd + tma_ms= )) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / t= ma_other_nukes + tma_core_bound * (tma_serializing_operation + RS.EMPTY_RES= OURCE / tma_info_thread_clks * tma_ports_utilized_0) / (tma_divider + tma_s= erializing_operation + tma_ports_utilization) + tma_microcode_sequencer / (= tma_microcode_sequencer + tma_few_uops_instructions) * (tma_assists / tma_m= icrocode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Default;Ret;Scaled_Slots;TopdownL1;tm= a_L1_group;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;Default;Scaled_Slot= s;TopdownL1;tma_L1_group;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_microcode_sequencer + tma_few_uops_instr= uctions) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Default;Ret;Scaled_Slots;TopdownL1;tma_L1_gro= up", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to BACLEARS, which occurs when the Branch T= arget Buffer (BTB) prediction or lack thereof, was corrected by a later bra= nch predictor in the frontend", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.BRANCH_DETECT@ / (8 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricName": "tma_branch_detect", + "MetricThreshold": "(tma_branch_detect >0.05) & ((tma_ifetch_laten= cy >0.15) & ((tma_frontend_bound >0.20)))", + "PublicDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to BACLEARS, which occurs when the Branch = Target Buffer (BTB) prediction or lack thereof, was corrected by a later br= anch predictor in the frontend. Includes BACLEARS due to all branch types i= ncluding conditional and unconditional jumps, returns, and indirect branche= s.", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", + "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * sl= ots", + "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", + "MetricName": "tma_branch_mispredicts", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost,= tma_mispredicts_resteers", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to BTCLEARS, which occurs when the Branch T= arget Buffer (BTB) predicts a taken branch.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.BRANCH_RESTEER@ / (8 * cp= u_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricName": "tma_branch_resteer", + "MetricThreshold": "(tma_branch_resteer >0.05) & ((tma_ifetch_late= ncy >0.15) & ((tma_frontend_bound >0.20)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers", + "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", + "MetricGroup": "Clocks;FetchLat;TopdownL3;tma_L3_group;tma_fetch_l= atency_group;tma_overlap", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C01 / tma_info_thread_clks", + "MetricGroup": "C0Wait;Clocks;TopdownL4;tma_L4_group;tma_serializi= ng_operation_group", + "MetricName": "tma_c01_wait", + "MetricThreshold": "tma_c01_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C02 / tma_info_thread_clks", + "MetricGroup": "C0Wait;Clocks;TopdownL4;tma_L4_group;tma_serializi= ng_operation_group", + "MetricName": "tma_c02_wait", + "MetricThreshold": "tma_c02_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU retired uops originated from CISC (complex instruction set computer) in= struction", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Machine Clears", + "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", + "MetricGroup": "BadSpec;Clocks;MachineClears;TopdownL4;tma_L4_grou= p;tma_branch_resteers_group;tma_issueMC", + "MetricName": "tma_clears_resteers", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "MetricExpr": "max(0, FRONTEND_RETIRED.L1I_MISS * cpu_core@FRONTEN= D_RETIRED.L1I_MISS@R / tma_info_thread_clks - tma_code_l2_miss)", + "MetricGroup": "Clocks_Retired;FetchLat;IcMiss;Offcore;TopdownL4;t= ma_L4_group;tma_icache_misses_group", + "MetricName": "tma_code_l2_hit", + "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "MetricExpr": "FRONTEND_RETIRED.L2_MISS * cpu_core@FRONTEND_RETIRE= D.L2_MISS@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;FetchLat;IcMiss;Offcore;TopdownL4;t= ma_L4_group;tma_icache_misses_group", + "MetricName": "tma_code_l2_miss", + "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, FRONTEND_RETIRED.ITLB_MISS * cpu_core@FRONTE= ND_RETIRED.ITLB_MISS@R / tma_info_thread_clks - tma_code_stlb_miss)", + "MetricGroup": "Clocks_Retired;FetchLat;MemoryTLB;TopdownL4;tma_L4= _group;tma_itlb_misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "FRONTEND_RETIRED.STLB_MISS * cpu_core@FRONTEND_RETI= RED.STLB_MISS@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;FetchLat;MemoryTLB;TopdownL4;tma_L4= _group;tma_itlb_misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks * IT= LB_MISSES.WALK_COMPLETED_2M_4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISS= ES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "Clocks_Estimated;FetchLat;MemoryTLB;TopdownL5;tma_= L5_group;tma_code_stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks * IT= LB_MISSES.WALK_COMPLETED_4K / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.= WALK_COMPLETED_2M_4M)", + "MetricGroup": "Clocks_Estimated;FetchLat;MemoryTLB;TopdownL5;tma_= L5_group;tma_code_stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by non-taken conditional bran= ches", + "MetricExpr": "BR_MISP_RETIRED.COND_NTAKEN_COST * cpu_core@BR_MISP= _RETIRED.COND_NTAKEN_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_cond_nt_mispredicts", + "MetricThreshold": "tma_cond_nt_mispredicts > 0.05 & tma_branch_mi= spredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by backward-taken conditional branche= s", + "MetricExpr": "BR_MISP_RETIRED.COND_TAKEN_BWD_COST * cpu_core@BR_M= ISP_RETIRED.COND_TAKEN_BWD_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_cond_tk_bwd_mispredicts", + "MetricThreshold": "tma_cond_tk_bwd_mispredicts > 0.05 & tma_branc= h_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by forward-taken conditional branches= ", + "MetricExpr": "BR_MISP_RETIRED.COND_TAKEN_FWD_COST * cpu_core@BR_M= ISP_RETIRED.COND_TAKEN_FWD_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_cond_tk_fwd_mispredicts", + "MetricThreshold": "tma_cond_tk_fwd_mispredicts > 0.05 & tma_branc= h_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", + "MetricExpr": "((min(MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS * cpu_core@= MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS * (2= 7 * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequency) i= f 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@R else MEM_LOAD_L3_HIT_RET= IRED.XSNP_MISS * (27 * tma_info_system_core_frequency) - 3 * tma_info_syste= m_core_frequency) + (min(MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * cpu_core@MEM_L= OAD_L3_HIT_RETIRED.XSNP_HITM@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (28 * t= ma_info_system_core_frequency) - 3 * tma_info_system_core_frequency) if 0 <= cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM@R else MEM_LOAD_L3_HIT_RETIRED.= XSNP_HITM * (28 * tma_info_system_core_frequency) - 3 * tma_info_system_cor= e_frequency)) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2= ) / tma_info_thread_clks", + "MetricGroup": "BvMS;Clocks_Estimated;DataSharing;LockCont;Offcore= ;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_contested_accesses", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_data_sharing, tma_machine_clears", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e Core non-memory issues were of a bottleneck", + "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)", + "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", + "MetricExpr": "((min(MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD * cpu_cor= e@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FW= D * (27 * tma_info_system_core_frequency) - 3 * tma_info_system_core_freque= ncy) if 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@R else MEM_LOAD_L3= _HIT_RETIRED.XSNP_NO_FWD * (27 * tma_info_system_core_frequency) - 3 * tma_= info_system_core_frequency) + (min(MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * cpu_c= ore@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * = (28 * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequency)= if 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@R else MEM_LOAD_L3_HIT_RE= TIRED.XSNP_FWD * (28 * tma_info_system_core_frequency) - 3 * tma_info_syste= m_core_frequency)) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MIS= S / 2) / tma_info_thread_clks", + "MetricGroup": "BvMS;Clocks_Estimated;Offcore;Snoop;TopdownL4;tma_= L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_contested_accesses, tma= _machine_clears", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to decode stalls.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.DECODE@ / (8 * cpu_atom@C= PU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_decode", + "MetricThreshold": "(tma_decode >0.05) & ((tma_ifetch_bandwidth >0= .10) & ((tma_frontend_bound >0.20)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Divider unit was active", + "MetricExpr": "ARITH.DIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "BvCB;Clocks;TopdownL3;tma_L3_group;tma_core_bound_= group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIV_ACTIVE", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled on accesses to external memory (DRAM) by loads", + "MetricExpr": "MEMORY_STALLS.MEM / tma_info_thread_clks", + "MetricGroup": "MemoryBound;Stalls;TmaL3mem;TopdownL3;tma_L3_group= ;tma_memory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to DSB (decoded uop cache) fetch pipe= line", + "MetricExpr": "(cpu_core@IDQ.DSB_UOPS\\,cmask\\=3D0x8\\,inv\\=3D0x= 1@ + IDQ.DSB_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS) * (IDQ_BUBBLES.CYCLES_0_= UOPS_DELIV.CORE - IDQ_BUBBLES.FETCH_LATENCY)) / tma_info_thread_clks", + "MetricGroup": "DSB;FetchBW;Slots_Estimated;TopdownL3;tma_L3_group= ;tma_fetch_bandwidth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", + "MetricGroup": "Clocks;DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma= _fetch_latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricExpr": "(min(MEM_INST_RETIRED.STLB_HIT_LOADS * cpu_core@MEM= _INST_RETIRED.STLB_HIT_LOADS@R, MEM_INST_RETIRED.STLB_HIT_LOADS * 7) if 0 <= cpu_core@MEM_INST_RETIRED.STLB_HIT_LOADS@R else MEM_INST_RETIRED.STLB_HIT_= LOADS * 7) / tma_info_thread_clks + tma_load_stlb_miss", + "MetricGroup": "BvMT;Clocks_Estimated;MemoryTLB;TopdownL4;tma_L4_g= roup;tma_issueTLB;tma_l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= dtlb_store", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", + "MetricExpr": "(min(MEM_INST_RETIRED.STLB_HIT_STORES * cpu_core@ME= M_INST_RETIRED.STLB_HIT_STORES@R, MEM_INST_RETIRED.STLB_HIT_STORES * 7) if = 0 < cpu_core@MEM_INST_RETIRED.STLB_HIT_STORES@R else MEM_INST_RETIRED.STLB_= HIT_STORES * 7) / tma_info_thread_clks + tma_store_stlb_miss", + "MetricGroup": "BvMT;Clocks_Estimated;MemoryTLB;TopdownL4;tma_L4_g= roup;tma_issueTLB;tma_store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_dtlb_load", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that does not require the = use of microcode, classified as a fast nuke, due to memory ordering, memory= disambiguation and memory renaming", + "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.FASTNUKE@ / (8 * c= pu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", + "MetricName": "tma_fast_nuke", + "MetricThreshold": "(tma_fast_nuke >0.05) & ((tma_machine_clears >= 0.05) & ((tma_bad_speculation >0.15)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", + "MetricExpr": "L1D_MISS.FB_FULL / tma_info_thread_clks", + "MetricGroup": "BvMB;Clocks_Calculated;MemoryBW;TopdownL4;tma_L4_g= roup;tma_issueBW;tma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend bandwidth issues", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)", + "MetricGroup": "Default;FetchBW;Frontend;Slots;TmaL2;TopdownL2;tma= _L2_group;tma_frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.2", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Rel= ated metrics: tma_dsb_switches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_= botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_inst_mix_ipt= b, tma_lcp", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend latency issues", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Default;Frontend;Slots;TmaL2;TopdownL2;tma_L2_grou= p;tma_frontend_bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", + "MetricExpr": "max(0, tma_heavy_operations - tma_microcode_sequenc= er)", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_heavy_operations_= group;tma_issueD0", + "MetricName": "tma_few_uops_instructions", + "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents overall arithmetic flo= ating-point (FP) operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", + "MetricGroup": "HPC;TopdownL3;Uops;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Floating Point (FP) Assists", + "MetricExpr": "30 * ASSISTS.FP / tma_info_thread_slots", + "MetricGroup": "HPC;Slots_Estimated;TopdownL5;tma_L5_group;tma_ass= ists_group", + "MetricName": "tma_fp_assists", + "MetricThreshold": "tma_fp_assists > 0.1", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "MetricExpr": "ARITH.FPDIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_fp_divider", + "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) scalar uops fraction the CPU has retired", + "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / (tma_retiring * tma_= info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;Uops;tma_L4_group;tma_fp_a= rith_group;tma_issue2P", + "MetricName": "tma_fp_scalar", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b, tma_ports_utili= zed_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) vector uops fraction the CPU has retired aggregated across all v= ector widths", + "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / (tma_retiring * tma_= info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;Uops;tma_L4_group;tma_fp_a= rith_group;tma_issue2P", + "MetricName": "tma_fp_vector", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, = tma_int_vector_256b, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 128-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", + "MetricGroup": "Compute;Flops;TopdownL5;Uops;tma_L5_group;tma_fp_v= ector_group;tma_issue2P", + "MetricName": "tma_fp_vector_128b", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b,= tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 256-bit wide vectors", + "MetricExpr": "cpu_core@FP_ARITH_INST_RETIRED.VECTOR\\,umask\\=3D0= x30@ / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;Uops;tma_L5_group;tma_fp_v= ector_group;tma_issue2P", + "MetricName": "tma_fp_vector_256b", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_int_vector_128b, tma_int_vector_256b,= tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This category represents fraction of slots wh= ere the processor's Frontend undersupplies its Backend", + "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.MACRO_FUSED / (= tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Branches;BvBO;Pipeline;Slots;TopdownL3;tma_L3_grou= p;tma_light_operations_group", + "MetricName": "tma_fused_instructions", + "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Default;Retire;Slots;TmaL2;TopdownL2;tma_L2_group;= tma_retiring_group", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+]). Sample with: UOPS_RET= IRED.HEAVY", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", + "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend bandwidth restrictions due to d= ecode, predecode, cisc, and other limitations.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH@ / (8 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_ifetch_bandwidth", + "MetricThreshold": "(tma_ifetch_bandwidth >0.10) & ((tma_frontend_= bound >0.20))", + "MetricgroupNoGroup": "TopdownL2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend latency restrictions due to ica= che misses, itlb misses, branch detection, and resteer limitations.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_LATENCY@ / (8 * = cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_ifetch_latency", + "MetricThreshold": "(tma_ifetch_latency >0.15) & ((tma_frontend_bo= und >0.20))", + "MetricgroupNoGroup": "TopdownL2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect CALL instructions= ", + "MetricExpr": "BR_MISP_RETIRED.INDIRECT_CALL_COST * cpu_core@BR_MI= SP_RETIRED.INDIRECT_CALL_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_ind_call_mispredicts", + "MetricThreshold": "tma_ind_call_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect JMP instructions", + "MetricExpr": "max((BR_MISP_RETIRED.INDIRECT_COST * cpu_core@BR_MI= SP_RETIRED.INDIRECT_COST@R - BR_MISP_RETIRED.INDIRECT_CALL_COST * cpu_core@= BR_MISP_RETIRED.INDIRECT_CALL_COST@R) / tma_info_thread_clks, 0)", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_ind_jump_mispredicts", + "MetricThreshold": "tma_ind_jump_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Floating Point (FP) Operatio= n", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@FP_FLOPS_RETI= RED.ALL@", + "MetricGroup": "Flops", + "MetricName": "tma_info_arith_inst_mix_ipflop", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic instruction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@FP_INST_RETIR= ED.ALL@", + "MetricGroup": "Flops", + "MetricName": "tma_info_arith_inst_mix_ipfparith", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / (cpu_atom@FP_INST_RETI= RED.128B_DP@ + cpu_atom@FP_INST_RETIRED.128B_SP@)", + "MetricGroup": "Flops", + "MetricName": "tma_info_arith_inst_mix_ipfparith_avx128", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX 256-bit in= struction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / (cpu_atom@FP_INST_RETI= RED.256B_DP@ + cpu_atom@FP_INST_RETIRED.256B_SP@)", + "MetricGroup": "Flops", + "MetricName": "tma_info_arith_inst_mix_ipfparith_avx256", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@FP_INST_RETIR= ED.64B_DP@", + "MetricGroup": "Flops", + "MetricName": "tma_info_arith_inst_mix_ipfparith_scalar_dp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@FP_INST_RETIR= ED.32B_SP@", + "MetricGroup": "Flops", + "MetricName": "tma_info_arith_inst_mix_ipfparith_scalar_sp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 8 / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "MetricGroup": "Bad;BrMispredicts;Core_Metric;tma_issueBM", + "MetricName": "tma_info_bad_spec_branch_misprediction_cost", + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", + "MetricGroup": "Bad;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", + "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for cond= itional backward-taken branches (lower number means higher occurrence rate)= ", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN_BWD", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_cond_taken_bwd", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for cond= itional forward-taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN_FWD", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_cond_taken_fwd", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", + "MetricGroup": "Bad;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmisp_indirect", + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", + "MetricGroup": "Bad;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmisp_ret", + "MetricThreshold": "tma_info_bad_spec_ipmisp_ret < 500", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmispredict", + "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", + "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", + "MetricGroup": "BrMispredicts;Metric", + "MetricName": "tma_info_bad_spec_spec_clears_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_lsd + tma_ms)))", + "MetricGroup": "DSB;Fed;FetchBW;Scaled_Slots;tma_issueFB", + "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", + "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd + tma_ms))", + "MetricGroup": "DSBmiss;Fed;Scaled_Slots;tma_issueFB", + "MetricName": "tma_info_botlnk_l2_dsb_misses", + "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) misse= s - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb= _switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", + "MetricGroup": "Fed;FetchLat;IcMiss;Scaled_Slots;tma_issueFL", + "MetricName": "tma_info_botlnk_l2_ic_misses", + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that retirement is stalled= due to a first level data TLB miss", + "MetricExpr": "100 * (cpu_atom@LD_HEAD.DTLB_MISS_AT_RET@ + cpu_ato= m@LD_HEAD.PGWALK_AT_RET@) / cpu_atom@CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_bottleneck_%_dtlb_miss_bound_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that allocation and retire= ment is stalled by the Frontend Cluster due to an Ifetch Miss, either Icach= e or ITLB Miss", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.ALL@ / cpu_a= tom@CPU_CLK_UNHALTED.CORE@", + "MetricGroup": "Ifetch", + "MetricName": "tma_info_bottleneck_%_ifetch_miss_bound_cycles", + "PublicDescription": "Percentage of time that allocation and retir= ement is stalled by the Frontend Cluster due to an Ifetch Miss, either Icac= he or ITLB Miss. See Info.Ifetch_Bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that retirement is stalled= due to an L1 miss", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_LOAD.ALL@ / cpu_ato= m@CPU_CLK_UNHALTED.CORE@", + "MetricGroup": "Load_Store_Miss", + "MetricName": "tma_info_bottleneck_%_load_miss_bound_cycles", + "PublicDescription": "Percentage of time that retirement is stalle= d due to an L1 miss. See Info.Load_Miss_Bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that retirement is stalled= by the Memory Cluster due to a pipeline stall", + "MetricExpr": "100 * cpu_atom@LD_HEAD.ANY_AT_RET@ / cpu_atom@CPU_C= LK_UNHALTED.CORE@", + "MetricGroup": "Mem_Exec", + "MetricName": "tma_info_bottleneck_%_mem_exec_bound_cycles", + "PublicDescription": "Percentage of time that retirement is stalle= d by the Memory Cluster due to a pipeline stall. See Info.Mem_Exec_Bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.ALL_BRANCHES@", + "MetricName": "tma_info_br_inst_mix_ipbranch", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instruction per (near) call (lower number mea= ns higher occurrence rate)", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.NEAR_CALL@", + "MetricName": "tma_info_br_inst_mix_ipcall", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.FAR_BRANCH@u", + "MetricName": "tma_info_br_inst_mix_ipfarbranch", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired conditional Branch M= isprediction where the branch was not taken", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / (cpu_atom@BR_MISP_RETI= RED.COND@ - cpu_atom@BR_MISP_RETIRED.COND_TAKEN@)", + "MetricName": "tma_info_br_inst_mix_ipmisp_cond_ntaken", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired conditional Branch M= isprediction where the branch was taken", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.COND_TAKEN@", + "MetricName": "tma_info_br_inst_mix_ipmisp_cond_taken", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired indirect call or jum= p Branch Misprediction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.INDIRECT@", + "MetricName": "tma_info_br_inst_mix_ipmisp_indirect", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired return Branch Mispre= diction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.RETURN@", + "MetricName": "tma_info_br_inst_mix_ipmisp_ret", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired Branch Misprediction= ", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.ALL_BRANCHES@", + "MetricName": "tma_info_br_inst_mix_ipmispredict", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Ratio of all branches which mispredict", + "MetricExpr": "cpu_atom@BR_MISP_RETIRED.ALL_BRANCHES@ / cpu_atom@B= R_INST_RETIRED.ALL_BRANCHES@", + "MetricName": "tma_info_br_mispredict_bound_branch_mispredict_rati= o", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Ratio between Mispredicted branches and unkno= wn branches", + "MetricExpr": "cpu_atom@BR_MISP_RETIRED.ALL_BRANCHES@ / cpu_atom@B= ACLEARS.ANY@", + "MetricName": "tma_info_br_mispredict_bound_branch_mispredict_to_u= nknown_branch_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are CALL or RET", + "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_R= ETURN) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;Fraction", + "MetricName": "tma_info_branches_callret", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are non-taken condi= tionals", + "MetricExpr": "BR_INST_RETIRED.COND_NTAKEN / BR_INST_RETIRED.ALL_B= RANCHES", + "MetricGroup": "Bad;Branches;CodeGen;Fraction;PGO", + "MetricName": "tma_info_branches_cond_nt", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are forward taken c= onditionals", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN_BWD / BR_INST_RETIRED.AL= L_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;Fraction;PGO", + "MetricName": "tma_info_branches_cond_tk_bwd", + "MetricThreshold": "tma_info_branches_cond_tk_bwd > 0.3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are forward taken c= onditionals", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN_FWD / BR_INST_RETIRED.AL= L_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;Fraction;PGO", + "MetricName": "tma_info_branches_cond_tk_fwd", + "MetricThreshold": "tma_info_branches_cond_tk_fwd > 0.2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are unconditional (= direct or indirect) jumps", + "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_= TAKEN_BWD - BR_INST_RETIRED.COND_TAKEN_FWD - 2 * BR_INST_RETIRED.NEAR_CALL)= / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;Fraction", + "MetricName": "tma_info_branches_jump", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches of other types (not indi= vidually covered by other metrics in Info.Branches group)", + "MetricExpr": "1 - (tma_info_branches_cond_nt + tma_info_branches_= cond_tk_bwd + tma_info_branches_cond_tk_fwd + tma_info_branches_callret + t= ma_info_branches_jump)", + "MetricGroup": "Bad;Branches;Fraction", + "MetricName": "tma_info_branches_other_branches", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that allocation is stalled= due to load buffer full", + "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.LD_BUF@ / cpu_at= om@CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_buffer_stalls_%_load_buffer_stall_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that allocation is stalled= due to memory reservation stations full", + "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.RSV@ / cpu_atom@= CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_buffer_stalls_%_mem_rsv_stall_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that allocation is stalled= due to store buffer full", + "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.ST_BUF@ / cpu_at= om@CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_buffer_stalls_%_store_buffer_stall_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles Per Instruction", + "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.CORE@ / cpu_atom@INST_RET= IRED.ANY@", + "MetricName": "tma_info_core_cpi", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "uops Executed per Cycle", + "MetricExpr": "UOPS_EXECUTED.THREAD / tma_info_thread_clks", + "MetricGroup": "Metric;Power", + "MetricName": "tma_info_core_epc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Floating Point Operations Per Cycle", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / tma_info_thread_clks", + "MetricGroup": "Flops", + "MetricName": "tma_info_core_flopc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Actual per-core usage of the Floating Point n= on-X87 execution units (regardless of precision or vector-width)", + "MetricExpr": "(FP_ARITH_DISPATCHED.V0 + FP_ARITH_DISPATCHED.V1 + = FP_ARITH_DISPATCHED.V2 + FP_ARITH_DISPATCHED.V3) / (4 * tma_info_thread_clk= s)", + "MetricGroup": "Cor;Core_Metric;Flops;HPC", + "MetricName": "tma_info_core_fp_arith_utilization", + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu_core@UOPS_EXECUTED.THREA= D\\,cmask\\=3D0x1@", + "MetricGroup": "Backend;Cor;Metric;Pipeline;PortsUtil", + "MetricName": "tma_info_core_ilp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions Per Cycle", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@CPU_CLK_UNHAL= TED.CORE@", + "MetricName": "tma_info_core_ipc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "cpu_atom@TOPDOWN_RETIRING.ALL@ / cpu_atom@INST_RETI= RED.ANY@", + "MetricName": "tma_info_core_upi", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of Uops delivered by the DSB (aka De= coded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "DSB;Fed;FetchBW;Metric;tma_issueFB", + "MetricName": "tma_info_frontend_dsb_coverage", + "MetricThreshold": "tma_info_frontend_dsb_coverage < 0.7 & tma_inf= o_thread_ipc / 8 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu_core@DSB2MIT= E_SWITCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricGroup": "DSBmiss;Metric", + "MetricName": "tma_info_frontend_dsb_switch_cost", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired DSB misses", + "MetricExpr": "FRONTEND_RETIRED.ANY_DSB_MISS * cpu_core@FRONTEND_R= ETIRED.ANY_DSB_MISS@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;DSBmiss;Fed;FetchLat", + "MetricName": "tma_info_frontend_dsb_switches_ret", + "MetricThreshold": "tma_info_frontend_dsb_switches_ret > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", + "MetricExpr": "UOPS_ISSUED.ANY / cpu_core@UOPS_ISSUED.ANY\\,cmask\= \=3D0x1@", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_frontend_fetch_upc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Latency for L1 instruction cache miss= es", + "MetricExpr": "ICACHE_DATA.STALLS / ICACHE_DATA.STALL_PERIODS", + "MetricGroup": "Fed;FetchLat;IcMiss;Metric", + "MetricName": "tma_info_frontend_icache_miss_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per non-speculative DSB miss (lo= wer number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", + "MetricGroup": "DSBmiss;Fed;Inst_Metric", + "MetricName": "tma_info_frontend_ipdsb_miss_ret", + "MetricThreshold": "tma_info_frontend_ipdsb_miss_ret < 50", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per speculative Unknown Branch M= isprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_inst_mix_instructions / BACLEARS.ANY", + "MetricGroup": "Fed;Metric", + "MetricName": "tma_info_frontend_ipunknown_branch", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache true code cacheline misses per kilo = instruction", + "MetricExpr": "1e3 * FRONTEND_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss;Metric", + "MetricName": "tma_info_frontend_l2mpki_code", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache speculative code cacheline misses pe= r kilo instruction", + "MetricExpr": "1e3 * L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss;Metric", + "MetricName": "tma_info_frontend_l2mpki_code_all", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of Uops delivered by the LSD (Loop S= tream Detector; aka Loop Cache)", + "MetricExpr": "LSD.UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "Fed;LSD;Metric", + "MetricName": "tma_info_frontend_lsd_coverage", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired operations that invoke th= e Microcode Sequencer", + "MetricExpr": "FRONTEND_RETIRED.MS_FLOWS * cpu_core@FRONTEND_RETIR= ED.MS_FLOWS@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;Fed;FetchLat;MicroSeq", + "MetricName": "tma_info_frontend_ms_latency_ret", + "MetricThreshold": "tma_info_frontend_ms_latency_ret > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW;Metric", + "MetricName": "tma_info_frontend_tbpc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of cycles the front-end was de= layed due to an Unknown Branch detection", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu_core@INT_MISC.= UNKNOWN_BRANCH_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricGroup": "Fed;Metric", + "MetricName": "tma_info_frontend_unknown_branch_cost", + "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired branches who got branch a= ddress clears", + "MetricExpr": "FRONTEND_RETIRED.UNKNOWN_BRANCH * cpu_core@FRONTEND= _RETIRED.UNKNOWN_BRANCH@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;Fed;FetchLat", + "MetricName": "tma_info_frontend_unknown_branches_ret", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L2", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.L2_HIT@ / cp= u_atom@MEM_BOUND_STALLS_IFETCH.ALL@", + "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 2hit", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss doesn't hit in the L2", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.L2_MISS@ / c= pu_atom@MEM_BOUND_STALLS_IFETCH.ALL@", + "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 2miss", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L3", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.LLC_HIT@ / c= pu_atom@MEM_BOUND_STALLS_IFETCH.ALL@", + "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 3hit", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss subsequently misses in the L3", + "MetricExpr": "100 * (cpu_atom@MEM_BOUND_STALLS_IFETCH.L2_MISS@ - = cpu_atom@MEM_BOUND_STALLS_IFETCH.LLC_HIT@) / cpu_atom@MEM_BOUND_STALLS_IFET= CH.ALL@", + "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 3miss", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Branch instructions per taken branch", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", + "MetricGroup": "Branches;Fed;Metric;PGO", + "MetricName": "tma_info_inst_mix_bptkbranch", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Count;Summary;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_info_inst_mix_instructions", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "Total number of retired Instructions. Sample= with: INST_RETIRED.PREC_DIST", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic instruction (l= ower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = FP_ARITH_INST_RETIRED.VECTOR)", + "MetricGroup": "Flops;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith", + "MetricThreshold": "tma_info_inst_mix_iparith < 10", + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_avx128", + "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_avx256", + "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOU= BLE", + "MetricGroup": "Flops;FpScalar;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_scalar_dp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SIN= GLE", + "MetricGroup": "Flops;FpScalar;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_scalar_sp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Branches;Fed;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipbranch", + "MetricThreshold": "tma_info_inst_mix_ipbranch < 8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per (near) call (lower number me= ans higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", + "MetricGroup": "Branches;Fed;Inst_Metric;PGO", + "MetricName": "tma_info_inst_mix_ipcall", + "MetricThreshold": "tma_info_inst_mix_ipcall < 200", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Floating Point (FP) Operatio= n (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_= FLOPS + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipflop", + "MetricThreshold": "tma_info_inst_mix_ipflop < 10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Load (lower number means hig= her occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", + "MetricGroup": "InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipload", + "MetricThreshold": "tma_info_inst_mix_ipload < 3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per PAUSE (lower number means hi= gher occurrence rate)", + "MetricExpr": "tma_info_inst_mix_instructions / CPU_CLK_UNHALTED.P= AUSE_INST", + "MetricGroup": "Flops;FpVector;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ippause", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Store (lower number means hi= gher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", + "MetricGroup": "InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipstore", + "MetricThreshold": "tma_info_inst_mix_ipstore < 8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_SWPF", + "MetricGroup": "Inst_Metric;Prefetches", + "MetricName": "tma_info_inst_mix_ipswpf", + "MetricThreshold": "tma_info_inst_mix_ipswpf < 100", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per taken branch", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;Inst_Metric;PGO;tma_= issueFB", + "MetricName": "tma_info_inst_mix_iptb", + "MetricThreshold": "tma_info_inst_mix_iptb < 8 * 2 + 1", + "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L2", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_LOAD.L2_HIT@ / cpu_= atom@MEM_BOUND_STALLS_LOAD.ALL@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l2hit= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that subsequently misses in the L2", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_LOAD.L2_MISS@ / cpu= _atom@MEM_BOUND_STALLS_LOAD.ALL@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l2mis= s", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L3", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_LOAD.LLC_HIT@ / cpu= _atom@MEM_BOUND_STALLS_LOAD.ALL@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l3hit= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that subsequently misses the L3", + "MetricExpr": "100 * (cpu_atom@MEM_BOUND_STALLS_LOAD.L2_MISS@ - cp= u_atom@MEM_BOUND_STALLS_LOAD.LLC_HIT@) / cpu_atom@MEM_BOUND_STALLS_LOAD.ALL= @", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l3mis= s", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the oldest l= oad of the load buffer is stalled at retirement due to a pipeline block", + "MetricExpr": "100 * cpu_atom@LD_HEAD.L1_BOUND_AT_RET@ / cpu_atom@= CPU_CLK_UNHALTED.CORE@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_store_bound_l1_bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the oldest l= oad of the load buffer is stalled at retirement", + "MetricExpr": "100 * (cpu_atom@LD_HEAD.L1_BOUND_AT_RET@ + cpu_atom= @MEM_BOUND_STALLS_LOAD.ALL@) / cpu_atom@CPU_CLK_UNHALTED.CORE@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_store_bound_load_bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to store buffer full", + "MetricExpr": "100 * (cpu_atom@MEM_SCHEDULER_BLOCK.ST_BUF@ / cpu_a= tom@MEM_SCHEDULER_BLOCK.ALL@) * tma_mem_scheduler", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_store_bound_store_bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to memory disambiguation", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.DISAMBIGUATION@ / cpu= _atom@INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_disamb_= pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to floating point assists", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.FP_ASSIST@ / cpu_atom= @INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_fp_assi= st_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to memory ordering", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.MEMORY_ORDERING@ / cp= u_atom@INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_monuke_= pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to memory renaming", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.MRN_NUKE@ / cpu_atom@= INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_mrn_pki= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to page faults", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.PAGE_FAULT@ / cpu_ato= m@INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_page_fa= ult_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads wit= h an address aliasing block", + "MetricExpr": "100 * cpu_atom@LD_BLOCKS.ADDRESS_ALIAS@ / cpu_atom@= MEM_UOPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_exec_blocks_%_loads_with_adressaliasin= g", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads wit= h a store forward or unknown store address block", + "MetricExpr": "100 * cpu_atom@LD_BLOCKS.DATA_UNKNOWN@ / cpu_atom@M= EM_UOPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_exec_blocks_%_loads_with_storefwdblk", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= first level data cache miss", + "MetricExpr": "100 * cpu_atom@LD_HEAD.L1_MISS_AT_RET@ / cpu_atom@L= D_HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_l1miss", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to o= ther block cases, such as pipeline conflicts, fences, etc", + "MetricExpr": "100 * cpu_atom@LD_HEAD.OTHER_AT_RET@ / cpu_atom@LD_= HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_otherpipeli= neblks", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= pagewalk", + "MetricExpr": "100 * cpu_atom@LD_HEAD.PGWALK_AT_RET@ / cpu_atom@LD= _HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_pagewalk", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= second level TLB miss", + "MetricExpr": "100 * cpu_atom@LD_HEAD.DTLB_MISS_AT_RET@ / cpu_atom= @LD_HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_stlbhit", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= store forward address match", + "MetricExpr": "100 * cpu_atom@LD_HEAD.ST_ADDR_AT_RET@ / cpu_atom@L= D_HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_storefwding= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Load", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@MEM_UOPS_RETI= RED.ALL_LOADS@", + "MetricName": "tma_info_mem_mix_ipload", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Store", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@MEM_UOPS_RETI= RED.ALL_STORES@", + "MetricName": "tma_info_mem_mix_ipstore", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads tha= t perform one or more locks", + "MetricExpr": "100 * cpu_atom@MEM_UOPS_RETIRED.LOCK_LOADS@ / cpu_a= tom@MEM_UOPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_mix_load_locks_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads tha= t are splits", + "MetricExpr": "100 * cpu_atom@MEM_UOPS_RETIRED.SPLIT_LOADS@ / cpu_= atom@MEM_UOPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_mix_load_splits_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Ratio of mem load uops to all uops", + "MetricExpr": "1e3 * cpu_atom@MEM_UOPS_RETIRED.ALL_LOADS@ / cpu_at= om@TOPDOWN_RETIRING.ALL@", + "MetricName": "tma_info_mem_mix_memload_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fill Buffer (FB) hits per kilo instructions f= or retired demand loads (L1D misses that merge into ongoing miss-handling e= ntries)", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_fb_hpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= Level 0 within L1D cache [GB / sec]", + "MetricExpr": "64 * L1D.L0_REPLACEMENT / 1e9 / tma_info_system_tim= e", + "MetricGroup": "Mem;MemoryBW;Metric", + "MetricName": "tma_info_memory_l1dl0_cache_fill_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L1 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l1mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L1 cache true misses per kilo instruction for= all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.AN= Y", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l1mpki_load", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", + "MetricGroup": "Mem;MemoryBW;Metric", + "MetricName": "tma_info_memory_l2_cache_fill_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache hits per kilo instruction for all re= quest types (including speculative)", + "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_= RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2hpki_all", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache hits per kilo instruction for all de= mand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.AN= Y", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2hpki_load", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "Backend;CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all request types (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric;Offcore", + "MetricName": "tma_info_memory_l2mpki_all", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.A= NY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2mpki_load", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Offcore requests (L2 cache miss) per kilo ins= truction for demand RFOs", + "MetricExpr": "1e3 * L2_RQSTS.RFO_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Metric;Offcore", + "MetricName": "tma_info_memory_l2mpki_rfo", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", + "MetricGroup": "Mem;MemoryBW;Metric;Offcore", + "MetricName": "tma_info_memory_l3_cache_access_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", + "MetricGroup": "Mem;MemoryBW;Metric", + "MetricName": "tma_info_memory_l3_cache_fill_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L3 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", + "MetricGroup": "Mem;Metric", + "MetricName": "tma_info_memory_l3mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DATA_RD / OFFCORE_REQU= ESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Metric;Offcore", + "MetricName": "tma_info_memory_latency_data_l2_mlp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Latency for L2 cache miss demand Load= s", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "Clocks_Latency;LockCont;Memory_Lat;Offcore", + "MetricName": "tma_info_memory_latency_load_l2_miss_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu_c= ore@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", + "MetricGroup": "Memory_BW;Metric;Offcore", + "MetricName": "tma_info_memory_latency_load_l2_mlp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Latency for L3 cache miss demand Load= s", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAND_DATA_RD= / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", + "MetricGroup": "Clocks_Latency;Memory_Lat;Offcore", + "MetricName": "tma_info_memory_latency_load_l3_miss_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Actual Average Latency for L1 data-cache miss= demand load operations (in core cycles)", + "MetricExpr": "L1D_PENDING.LOAD / L1D_MISS.LOAD", + "MetricGroup": "Clocks_Latency;Mem;MemoryBound;MemoryLat", + "MetricName": "tma_info_memory_load_miss_real_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "\"Bus lock\" per kilo instruction", + "MetricExpr": "1e3 * SQ_MISC.BUS_LOCK / INST_RETIRED.ANY", + "MetricGroup": "Mem;Metric", + "MetricName": "tma_info_memory_mix_bus_lock_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Un-cacheable retired load per kilo instructio= n", + "MetricExpr": "1e3 * MEM_LOAD_MISC_RETIRED.UC / INST_RETIRED.ANY", + "MetricGroup": "Mem;Metric", + "MetricName": "tma_info_memory_mix_uc_load_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Memory-Level-Parallelism (average number of L= 1 miss demand load when there is at least one such miss", + "MetricExpr": "L1D_PENDING.LOAD / L1D_PENDING.LOAD_CYCLES", + "MetricGroup": "Mem;MemoryBW;MemoryBound;Metric", + "MetricName": "tma_info_memory_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Rate of L2 HW prefetched lines that were not = used by demand accesses", + "MetricExpr": "L2_LINES_OUT.USELESS_HWPF / (L2_LINES_OUT.SILENT + = L2_LINES_OUT.NON_SILENT)", + "MetricGroup": "Metric;Prefetches", + "MetricName": "tma_info_memory_prefetches_useless_hwpf", + "MetricThreshold": "tma_info_memory_prefetches_useless_hwpf > 0.15= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", + "MetricGroup": "Fed;MemoryTLB;Metric", + "MetricName": "tma_info_memory_tlb_code_stlb_mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand loads", + "MetricExpr": "MEM_INST_RETIRED.STLB_MISS_LOADS * cpu_core@MEM_INS= T_RETIRED.STLB_MISS_LOADS@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_load_stlb_miss_ret", + "MetricThreshold": "tma_info_memory_tlb_load_stlb_miss_ret > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "STLB (2nd level TLB) data load speculative mi= sses per kilo instruction (misses of any page-size that complete the page w= alk)", + "MetricExpr": "1e3 * DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRE= D.ANY", + "MetricGroup": "Mem;MemoryTLB;Metric", + "MetricName": "tma_info_memory_tlb_load_stlb_mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Utilization of the core's Page Walker(s) serv= ing STLB misses triggered by instruction/Load/Store accesses", + "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_P= ENDING + DTLB_STORE_MISSES.WALK_PENDING) / (4 * tma_info_thread_clks)", + "MetricGroup": "Core_Metric;Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_page_walks_utilization", + "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand stores", + "MetricExpr": "MEM_INST_RETIRED.STLB_MISS_STORES * cpu_core@MEM_IN= ST_RETIRED.STLB_MISS_STORES@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_store_stlb_miss_ret", + "MetricThreshold": "tma_info_memory_tlb_store_stlb_miss_ret > 0.05= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "STLB (2nd level TLB) data store speculative m= isses per kilo instruction (misses of any page-size that complete the page = walk)", + "MetricExpr": "1e3 * DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIR= ED.ANY", + "MetricGroup": "Mem;MemoryTLB;Metric", + "MetricName": "tma_info_memory_tlb_store_stlb_mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of uops fetched from DSB per c= ycle", + "MetricExpr": "IDQ.DSB_UOPS / IDQ.DSB_CYCLES_ANY", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_pipeline_fetch_dsb", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of uops fetched from LSD per c= ycle", + "MetricExpr": "LSD.UOPS / LSD.CYCLES_ACTIVE", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_pipeline_fetch_lsd", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of uops fetched from MITE per = cycle", + "MetricExpr": "IDQ.MITE_UOPS / IDQ.MITE_CYCLES_ANY", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_pipeline_fetch_mite", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per a microcode Assist invocatio= n", + "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", + "MetricGroup": "Inst_Metric;MicroSeq;Pipeline;Ret;Retire", + "MetricName": "tma_info_pipeline_ipassist", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", + "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D0x1@", + "MetricGroup": "Metric;Pipeline;Ret", + "MetricName": "tma_info_pipeline_retire", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", + "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu_core@UOPS_RETIRED.= SLOTS\\,cmask\\=3D0x1@", + "MetricGroup": "Metric;MicroSeq;Pipeline;Ret", + "MetricName": "tma_info_pipeline_strings_cycles", + "MetricThreshold": "tma_info_pipeline_strings_cycles > 0.1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that the core is stalled d= ue to a TPAUSE or UMWAIT instruction", + "MetricExpr": "100 * cpu_atom@SERIALIZATION.C01_MS_SCB@ / (8 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", + "MetricName": "tma_info_serialization _%_tpause_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of cycles the processor is waiting y= et unhalted; covering legacy PAUSE instruction, as well as C0.1 / C0.2 powe= r-performance optimized states", + "MetricExpr": "CPU_CLK_UNHALTED.C0_WAIT / tma_info_thread_clks", + "MetricGroup": "C0Wait;Metric", + "MetricName": "tma_info_system_c0_wait", + "MetricThreshold": "tma_info_system_c0_wait > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", + "MetricGroup": "Power;Summary;System_Metric", + "MetricName": "tma_info_system_core_frequency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average CPU Utilization (percentage)", + "MetricExpr": "tma_info_system_cpus_utilized / #num_cpus_online", + "MetricGroup": "HPC;Metric;Summary", + "MetricName": "tma_info_system_cpu_utilization", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of utilized CPUs", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": "Metric;Summary", + "MetricName": "tma_info_system_cpus_utilized", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / tma_info_system_time", + "MetricGroup": "Flops", + "MetricName": "tma_info_system_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;Inst_Metric;OS", + "MetricName": "tma_info_system_ipfarbranch", + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricGroup": "Metric;OS", + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THRE= AD", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_kernel_utilization", + "MetricThreshold": "tma_info_system_kernel_utilization > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Clocks;Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "power@energy\\-pkg@ * 61 / (tma_info_system_time * = 1e6)", + "MetricGroup": "Power;SoC;System_Metric", + "MetricName": "tma_info_system_power", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Socket actual clocks when any core is active = on that socket", + "MetricExpr": "UNC_CLOCK.SOCKET", + "MetricGroup": "Count;SoC", + "MetricName": "tma_info_system_socket_clks", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Seconds;Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", + "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_system_turbo_utilization", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Count;Pipeline", + "MetricName": "tma_info_thread_clks", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", + "MetricExpr": "1 / tma_info_thread_ipc", + "MetricGroup": "Mem;Metric;Pipeline", + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "The ratio of Executed- by Issued-Uops", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", + "MetricGroup": "Cor;Metric;Pipeline", + "MetricName": "tma_info_thread_execute_per_issue", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", + "MetricExpr": "INST_RETIRED.ANY / tma_info_thread_clks", + "MetricGroup": "Metric;Ret;Summary", + "MetricName": "tma_info_thread_ipc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "slots", + "MetricGroup": "Count;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_info_thread_slots", + "MetricgroupNoGroup": "TopdownL1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "tma_retiring * tma_info_thread_slots / INST_RETIRED= .ANY", + "MetricGroup": "Metric;Pipeline;Ret;Retire", + "MetricName": "tma_info_thread_uoppi", + "MetricThreshold": "tma_info_thread_uoppi > 1.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Uops per taken branch", + "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Metric", + "MetricName": "tma_info_thread_uptb", + "MetricThreshold": "tma_info_thread_uptb < 8 * 1.5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are FPDiv uops", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.FPDIV@ / cpu_atom@TOPDO= WN_RETIRING.ALL@", + "MetricName": "tma_info_uop_mix_fpdiv_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are IDiv uops", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.IDIV@ / cpu_atom@TOPDOW= N_RETIRING.ALL@", + "MetricName": "tma_info_uop_mix_idiv_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are microcode op= s", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.MS@ / cpu_atom@TOPDOWN_= RETIRING.ALL@", + "MetricName": "tma_info_uop_mix_microcode_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are x87 uops", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.X87@ / cpu_atom@TOPDOWN= _RETIRING.ALL@", + "MetricName": "tma_info_uop_mix_x87_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "MetricExpr": "tma_divider - tma_fp_divider", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_int_divider", + "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents overall Integer (Int) = select operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_int_vector_128b + tma_int_vector_256b", + "MetricGroup": "Pipeline;TopdownL3;Uops;tma_L3_group;tma_light_ope= rations_group", + "MetricName": "tma_int_operations", + "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operation= s > 0.6", + "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents 128-bit vector Integer= ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the= CPU has retired", + "MetricExpr": "INT_VEC_RETIRED.128BIT / (tma_retiring * tma_info_t= hread_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;Uops;tma_L4_g= roup;tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_128b", + "MetricThreshold": "tma_int_vector_128b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_int_vector_256b, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents 256-bit vector Integer= ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fraction= the CPU has retired", + "MetricExpr": "INT_VEC_RETIRED.256BIT / (tma_retiring * tma_info_t= hread_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;Uops;tma_L4_g= roup;tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_256b", + "MetricThreshold": "tma_int_vector_256b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_ports_utilized_= 2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", + "MetricExpr": "MEMORY_STALLS.L1 / tma_info_thread_clks", + "MetricGroup": "CacheHits;MemoryBound;Stalls;TmaL3mem;TopdownL3;tm= a_L3_group;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit Level 1 after missing Level 0 within the L1D= cache", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L1_HIT_L1 * cpu_core@MEM_LOAD= _RETIRED.L1_HIT_L1@R, MEM_LOAD_RETIRED.L1_HIT_L1 * 9) if 0 < cpu_core@MEM_L= OAD_RETIRED.L1_HIT_L1@R else MEM_LOAD_RETIRED.L1_HIT_L1 * 9) / tma_info_thr= ead_clks", + "MetricGroup": "BvML;Clocks_Retired;MemoryLat;TopdownL4;tma_L4_gro= up;tma_l1_bound_group", + "MetricName": "tma_l1_latency_capacity", + "MetricThreshold": "tma_l1_latency_capacity > 0.1 & tma_l1_bound >= 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", + "MetricExpr": "4 * DEPENDENT_LOADS.ANY / tma_info_thread_clks", + "MetricGroup": "BvML;Clocks_Estimated;MemoryLat;TopdownL4;tma_L4_g= roup;tma_l1_bound_group", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: DEPENDENT_LOADS.ANY", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", + "MetricExpr": "MEMORY_STALLS.L2 / tma_info_thread_clks", + "MetricGroup": "BvML;CacheHits;MemoryBound;Stalls;TmaL3mem;Topdown= L3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L2_HIT * cpu_core@MEM_LOAD_RE= TIRED.L2_HIT@R, MEM_LOAD_RETIRED.L2_HIT * (3 * tma_info_system_core_frequen= cy)) if 0 < cpu_core@MEM_LOAD_RETIRED.L2_HIT@R else MEM_LOAD_RETIRED.L2_HIT= * (3 * tma_info_system_core_frequency)) * (1 + MEM_LOAD_RETIRED.FB_HIT / M= EM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;MemoryLat;TopdownL4;tma_L4_group;tm= a_l2_bound_group", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled due to loads accesses to L3 cache or contended with a sibling Core", + "MetricExpr": "MEMORY_STALLS.L3 / tma_info_thread_clks", + "MetricGroup": "CacheHits;MemoryBound;Stalls;TmaL3mem;TopdownL3;tm= a_L3_group;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L3_HIT * cpu_core@MEM_LOAD_RE= TIRED.L3_HIT@R, MEM_LOAD_RETIRED.L3_HIT * (12 * tma_info_system_core_freque= ncy) - 3 * tma_info_system_core_frequency) if 0 < cpu_core@MEM_LOAD_RETIRED= .L3_HIT@R else MEM_LOAD_RETIRED.L3_HIT * (12 * tma_info_system_core_frequen= cy) - 3 * tma_info_system_core_frequency) * (1 + MEM_LOAD_RETIRED.FB_HIT / = MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "BvML;Clocks_Estimated;MemoryLat;TopdownL4;tma_L4_g= roup;tma_issueLat;tma_l3_bound_group;tma_overlap", + "MetricName": "tma_l3_hit_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ranch_resteers, tma_mem_latency, tma_store_latency", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "DECODE.LCP / tma_info_thread_clks", + "MetricGroup": "Clocks;FetchLat;TopdownL3;tma_L3_group;tma_fetch_l= atency_group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", + "MetricGroup": "Default;Retire;Slots;TmaL2;TopdownL2;tma_L2_group;= tma_retiring_group", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Load operations", + "MetricExpr": "UOPS_DISPATCHED.LOAD / (3 * tma_info_thread_clks)", + "MetricGroup": "Core_Execution;TopdownL5;tma_L5_group;tma_ports_ut= ilized_3m_group", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Load operations. Sample with: = UOPS_DISPATCHED.LOAD", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) DTLB was missed by load accesses, that late= r on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_dtlb_load - tma_load_stlb_miss)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL5;tma_L5_group;= tma_dtlb_load_group", + "MetricName": "tma_load_stlb_hit", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by load accesses, performing a= hardware page walk", + "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", + "MetricGroup": "Clocks_Calculated;MemoryTLB;TopdownL5;tma_L5_group= ;tma_dtlb_load_group", + "MetricName": "tma_load_stlb_miss", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_load_stlb_miss_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_load_stlb_miss_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_load_stlb_miss_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", + "MetricExpr": "MEM_INST_RETIRED.LOCK_LOADS * cpu_core@MEM_INST_RET= IRED.LOCK_LOADS@R / tma_info_thread_clks", + "MetricGroup": "Clocks;LockCont;Offcore;TopdownL4;tma_L4_group;tma= _issueRFO;tma_l1_bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to LSD (Loop Stream Detector) unit", + "MetricExpr": "cpu_core@LSD.UOPS\\,cmask\\=3D0x8\\,inv\\=3D0x1@ / = tma_info_thread_clks", + "MetricGroup": "FetchBW;LSD;Slots_Estimated;TopdownL3;tma_L3_group= ;tma_fetch_bandwidth_group", + "MetricName": "tma_lsd", + "MetricThreshold": "tma_lsd > 0.15 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Machine Clears", + "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts= )", + "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_l1_bound= , tma_microcode_sequencer, tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu_core@OFFCORE_REQUE= STS_OUTSTANDING.DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;Clocks;MemoryBW;Offcore;TopdownL4;tma_L4_grou= p;tma_dram_bound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_sq_full", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the performance was likely hurt due to latency from external memory - DRA= M ([SPR-HBM] and/or HBM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", + "MetricGroup": "BvML;Clocks;MemoryLat;Offcore;TopdownL4;tma_L4_gro= up;tma_dram_bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_l3_hit_latency", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to memory reservation stalls in which a sched= uler is not able to accept uops", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.MEM_SCHEDULER@ / (8 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_mem_scheduler", + "MetricThreshold": "(tma_mem_scheduler >0.10) & ((tma_resource_bou= nd >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Backend;Default;Slots;TmaL2;TopdownL2;tma_L2_group= ;tma_backend_bound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "13 * MISC2_RETIRED.LFENCE / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", + "MetricName": "tma_memory_fence", + "MetricThreshold": "tma_memory_fence > 0.05 & tma_serializing_oper= ation > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "MetricExpr": "tma_light_operations * MEM_UOP_RETIRED.ANY / (tma_r= etiring * tma_info_thread_slots)", + "MetricGroup": "Pipeline;Slots;TopdownL3;tma_L3_group;tma_light_op= erations_group", + "MetricName": "tma_memory_operations", + "MetricThreshold": "tma_memory_operations > 0.1 & tma_light_operat= ions > 0.6", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "UOPS_RETIRED.MS / tma_info_thread_slots", + "MetricGroup": "MicroSeq;Slots;TopdownL3;tma_L3_group;tma_heavy_op= erations_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_bottleneck_i= rregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, t= ma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Branch Misprediction= at execution stage", + "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * INT_= MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", + "MetricGroup": "BadSpec;BrMispredicts;BvMP;Clocks;TopdownL4;tma_L4= _group;tma_branch_resteers_group;tma_issueBM", + "MetricName": "tma_mispredicts_resteers", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the MITE pipeline (the legacy deco= de pipeline)", + "MetricExpr": "(cpu_core@IDQ.MITE_UOPS\\,cmask\\=3D0x8\\,inv\\=3D0= x1@ / tma_info_thread_clks + IDQ.MITE_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS)= * (IDQ_BUBBLES.CYCLES_0_UOPS_DELIV.CORE - IDQ_BUBBLES.FETCH_LATENCY)) / tm= a_info_thread_clks", + "MetricGroup": "DSBmiss;FetchBW;Slots_Estimated;TopdownL3;tma_L3_g= roup;tma_fetch_bandwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sa= mple with: FRONTEND_RETIRED.ANY_DSB_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "MetricExpr": "160 * ASSISTS.SSE_AVX_MIX / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL5;tma_L5_group;tma_issueMV;tma_port= s_utilized_0_group", + "MetricName": "tma_mixing_vectors", + "MetricThreshold": "tma_mixing_vectors > 0.05", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "MetricExpr": "IDQ.MS_CYCLES_ANY / tma_info_thread_clks", + "MetricGroup": "MicroSeq;Slots_Estimated;TopdownL3;tma_L3_group;tm= a_fetch_bandwidth_group", + "MetricName": "tma_ms", + "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", + "MetricExpr": "3 * IDQ.MS_SWITCHES / tma_info_thread_clks", + "MetricGroup": "Clocks_Estimated;FetchLat;MicroSeq;TopdownL3;tma_L= 3_group;tma_fetch_latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_iss= ueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions that were not fused", + "MetricExpr": "tma_light_operations * (BR_INST_RETIRED.ALL_BRANCHE= S - INST_RETIRED.BR_FUSED) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Branches;BvBO;Pipeline;Slots;TopdownL3;tma_L3_grou= p;tma_light_operations_group", + "MetricName": "tma_non_fused_branches", + "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to IEC or FPC RAT stalls, which can be due to= FIQ or IEC reservation stalls in which the integer, floating point or SIMD= scheduler is not able to accept uops", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER@ / (8 *= cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_non_mem_scheduler", + "MetricThreshold": "(tma_non_mem_scheduler >0.10) & ((tma_resource= _bound >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring NOP (no op) instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", + "MetricGroup": "BvBO;Pipeline;Slots;TopdownL4;tma_L4_group;tma_oth= er_light_ops_group", + "MetricName": "tma_nop_instructions", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that requires the use of m= icrocode (slow nuke)", + "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.NUKE@ / (8 * cpu_a= tom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", + "MetricName": "tma_nuke", + "MetricThreshold": "(tma_nuke >0.05) & ((tma_machine_clears >0.05)= & ((tma_bad_speculation >0.15)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to other common frontend stalls not categor= ized.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.OTHER@ / (8 * cpu_atom@CP= U_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_other_fb", + "MetricThreshold": "(tma_other_fb >0.05) & ((tma_ifetch_bandwidth = >0.10) & ((tma_frontend_bound >0.20)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents the remaining light uo= ps fraction the CPU has executed - remaining means not covered by other sib= ling nodes", + "MetricExpr": "max(0, tma_light_operations - (tma_x87_use + (FP_AR= ITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIRED.VECTOR) / (tma_retiring * t= ma_info_thread_slots) + (INT_VEC_RETIRED.ADD_128 + INT_VEC_RETIRED.VNNI_128= + INT_VEC_RETIRED.ADD_256 + INT_VEC_RETIRED.MUL_256 + INT_VEC_RETIRED.VNNI= _256) / (tma_retiring * tma_info_thread_slots) + tma_memory_operations + tm= a_fused_instructions + tma_non_fused_branches))", + "MetricGroup": "Pipeline;Slots;TopdownL3;tma_L3_group;tma_light_op= erations_group", + "MetricName": "tma_other_light_ops", + "MetricThreshold": "tma_other_light_ops > 0.3 & tma_light_operatio= ns > 0.6", + "PublicDescription": "This metric represents the remaining light u= ops fraction the CPU has executed - remaining means not covered by other si= bling nodes. May undercount due to FMA double counting", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", + "MetricGroup": "BrMispredicts;BvIO;Slots;TopdownL3;tma_L3_group;tm= a_branch_mispredicts_group", + "MetricName": "tma_other_mispredicts", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", + "MetricGroup": "BvIO;Machine_Clears;Slots;TopdownL3;tma_L3_group;t= ma_machine_clears_group", + "MetricName": "tma_other_nukes", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Page Faults", + "MetricExpr": "99 * ASSISTS.PAGE_FAULT / tma_info_thread_slots", + "MetricGroup": "Slots_Estimated;TopdownL5;tma_L5_group;tma_assists= _group", + "MetricName": "tma_page_faults", + "MetricThreshold": "tma_page_faults > 0.05", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", + "MetricExpr": "((EXE_ACTIVITY.EXE_BOUND_0_PORTS + (EXE_ACTIVITY.1_= PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_3_PORTS_UTIL)) / tma_info_thread= _clks if ARITH.DIV_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - EXE_ACTIVITY.BOUN= D_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_= 3_PORTS_UTIL) / tma_info_thread_clks)", + "MetricGroup": "Clocks;PortsUtil;TopdownL3;tma_L3_group;tma_core_b= ound_group", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.EXE_BOUND_0_PORTS / tma_info_thread_cl= ks", + "MetricGroup": "Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_ports_= utilization_group", + "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", + "MetricGroup": "Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_issueL= 1;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_1", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", + "MetricGroup": "Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_issue2= P;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma= _int_vector_256b", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", + "MetricGroup": "BvCB;Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_p= orts_utilization_group", + "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to wrong predecodes.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.PREDECODE@ / (8 * cpu_ato= m@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_predecode", + "MetricThreshold": "(tma_predecode >0.05) & ((tma_ifetch_bandwidth= >0.10) & ((tma_frontend_bound >0.20)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the physical register file unable to accep= t an entry (marble stalls)", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REGISTER@ / (8 * cpu_atom= @CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_register", + "MetricThreshold": "(tma_register >0.10) & ((tma_resource_bound >0= .20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the reorder buffer being full (ROB stalls)= ", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REORDER_BUFFER@ / (8 * cp= u_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_reorder_buffer", + "MetricThreshold": "(tma_reorder_buffer >0.10) & ((tma_resource_bo= und >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to a resource limitation", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALL_P@ / (8 * cpu_atom@CP= U_CLK_UNHALTED.CORE@) - tma_core_bound", + "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_resource_bound", + "MetricThreshold": "(tma_resource_bound >0.20) & ((tma_backend_bou= nd >0.10))", + "MetricgroupNoGroup": "TopdownL2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by (indirect) RET instruction= s", + "MetricExpr": "BR_MISP_RETIRED.RET_COST * cpu_core@BR_MISP_RETIRED= .RET_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_ret_mispredicts", + "MetricThreshold": "tma_ret_mispredicts > 0.05 & tma_branch_mispre= dicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots u= tilized by useful work i.e. issued uops that eventually get retired. Ideall= y; all pipeline slots would be attributed to the Retiring category. Retiri= ng of 100% would indicate the maximum Pipeline_Width throughput was achieve= d. Maximizing Retiring typically increases the Instructions-per-cycle (see= IPC metric). Note that a high Retiring value does not necessary mean there= is no room for more performance. For example; Heavy-operations or Microco= de Assists are categorized under Retiring. They often indicate suboptimal p= erformance and can often be optimized or avoided. Sample with: UOPS_RETIRED= .SLOTS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to scoreboards from the instruction queue (IQ= ), jump execution unit (JEU), or microcode sequencer (MS)", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.SERIALIZATION@ / (8 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_serialization", + "MetricThreshold": "(tma_serialization >0.10) & ((tma_resource_bou= nd >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU issue-pipeline was stalled due to serializing operations", + "MetricExpr": "(BE_STALLS.SCOREBOARD + CPU_CLK_UNHALTED.C02) / tma= _info_thread_clks", + "MetricGroup": "BvIO;Clocks;PortsUtil;TopdownL3;tma_L3_group;tma_c= ore_bound_group;tma_issueSO", + "MetricName": "tma_serializing_operation", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: BE_STALLS.SCOREBOARD. Related metrics: tm= a_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring Shuffle operations of 256-bit vector size (FP or Int= eger)", + "MetricExpr": "tma_light_operations * INT_VEC_RETIRED.SHUFFLES / (= tma_retiring * tma_info_thread_slots)", + "MetricGroup": "HPC;Pipeline;Slots;TopdownL4;tma_L4_group;tma_othe= r_light_ops_group", + "MetricName": "tma_shuffles_256b", + "MetricThreshold": "tma_shuffles_256b > 0.1 & tma_other_light_ops = > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to PAUSE Instructions", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "CPU_CLK_UNHALTED.PAUSE / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", + "MetricName": "tma_slow_pause", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.= PAUSE_INST", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles hand= ling memory load split accesses - load that cross 64-byte cache line bounda= ry", + "MetricExpr": "(min(MEM_INST_RETIRED.SPLIT_LOADS * cpu_core@MEM_IN= ST_RETIRED.SPLIT_LOADS@R, MEM_INST_RETIRED.SPLIT_LOADS * tma_info_memory_lo= ad_miss_real_latency) if 0 < cpu_core@MEM_INST_RETIRED.SPLIT_LOADS@R else M= EM_INST_RETIRED.SPLIT_LOADS * tma_info_memory_load_miss_real_latency) / tma= _info_thread_clks", + "MetricGroup": "Clocks_Calculated;TopdownL4;tma_L4_group;tma_l1_bo= und_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents rate of split store ac= cesses", + "MetricExpr": "(min(MEM_INST_RETIRED.SPLIT_STORES * cpu_core@MEM_I= NST_RETIRED.SPLIT_STORES@R, MEM_INST_RETIRED.SPLIT_STORES) if 0 < cpu_core@= MEM_INST_RETIRED.SPLIT_STORES@R else MEM_INST_RETIRED.SPLIT_STORES) / tma_i= nfo_thread_clks", + "MetricGroup": "Core_Utilization;TopdownL4;tma_L4_group;tma_issueS= pSt;tma_store_bound_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", + "MetricExpr": "(XQ.FULL + L1D_MISS.L2_STALLS) / tma_info_thread_cl= ks", + "MetricGroup": "BvMB;Clocks;MemoryBW;Offcore;TopdownL4;tma_L4_grou= p;tma_issueBW;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_mem_bandwidth", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often CPU was stall= ed due to RFO store memory accesses; RFO store issue a read-for-ownership = request before the write", + "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", + "MetricGroup": "MemoryBound;Stalls;TmaL3mem;TopdownL3;tma_L3_group= ;tma_memory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates fraction of cyc= les when the memory subsystem had loads blocked since they could not forwar= d data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", + "MetricGroup": "Clocks_Estimated;TopdownL4;tma_L4_group;tma_l1_bou= nd_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", + "MetricExpr": "(MEM_STORE_RETIRED.L2_HIT * 10 * (1 - MEM_INST_RETI= RED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_= LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE= _REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", + "MetricGroup": "BvML;Clocks_Estimated;LockCont;MemoryLat;Offcore;T= opdownL4;tma_L4_group;tma_issueRFO;tma_issueSL;tma_overlap;tma_store_bound_= group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Store operations", + "MetricExpr": "(UOPS_DISPATCHED.STD + UOPS_DISPATCHED.STA) / (7 * = tma_info_thread_clks)", + "MetricGroup": "Core_Execution;TopdownL5;tma_L5_group;tma_ports_ut= ilized_3m_group", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Store operations. Sample with:= UOPS_DISPATCHED.STD, UOPS_DISPATCHED.STA", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the TLB was missed by store accesses, hitting in the second-l= evel TLB (STLB)", + "MetricExpr": "max(0, tma_dtlb_store - tma_store_stlb_miss)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL5;tma_L5_group;= tma_dtlb_store_group", + "MetricName": "tma_store_stlb_hit", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the STLB was missed by store accesses, performing a hardware page wal= k", + "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_thread_clk= s", + "MetricGroup": "Clocks_Calculated;MemoryTLB;TopdownL5;tma_L5_group= ;tma_dtlb_store_group", + "MetricName": "tma_store_stlb_miss", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_store_stlb_miss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_store_stlb_miss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_store_stlb_miss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often CPU was stall= ed due to Streaming store memory accesses; Streaming store optimize out a = read request required by RFO stores", + "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", + "MetricGroup": "Clocks_Estimated;MemoryBW;Offcore;TopdownL4;tma_L4= _group;tma_issueSmSt;tma_store_bound_group", + "MetricName": "tma_streaming_stores", + "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to new branch address clears", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / tma_info_thread_cl= ks", + "MetricGroup": "BigFootprint;BvBC;Clocks;FetchLat;TopdownL4;tma_L4= _group;tma_branch_resteers_group", + "MetricName": "tma_unknown_branches", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric serves as an approximation of leg= acy x87 usage", + "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", + "MetricGroup": "Compute;TopdownL4;Uops;tma_L4_group;tma_fp_arith_g= roup", + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Uncore frequency per die [GHZ]", + "MetricExpr": "tma_info_system_socket_clks / #num_dies / duration_= time / 1e9", + "MetricGroup": "SoC", + "MetricName": "UNCORE_FREQ", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "MetricExpr": "UOPS_DISPATCHED.ALU / (6 * tma_info_thread_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.4", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops delivered by the Microcode_Sequencer as a result of Assists= ", + "MetricExpr": "78 * ASSISTS.ANY / tma_info_thread_slots", + "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists", + "MetricExpr": "63 * ASSISTS.SSE_AVX_MIX / tma_info_thread_slots", + "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", + "MetricName": "tma_avx_assists", + "MetricThreshold": "tma_avx_assists > 0.1", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", + "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "BvOB;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.2", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This category represents fraction of slots wa= sted due to incorrect speculations", + "MetricExpr": "topdown\\-bad\\-spec / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Ret", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_l1_latency_capacity + tma_lock_latency + tma_= split_loads + tma_fb_full)))", + "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_mem_b= andwidth, tma_sq_full", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound = + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_fb_full + tma_l1_latency_capacity= + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_sto= re_fwd_blk)) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tma_l1_= bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_l1_latency_c= apacity / (tma_dtlb_load + tma_fb_full + tma_l1_latency_capacity + tma_l1_l= atency_dependency + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)= ) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma= _l2_bound + tma_l3_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtl= b_load + tma_fb_full + tma_l1_latency_capacity + tma_l1_latency_dependency = + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)) + tma_memory_bou= nd * (tma_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_split_loads / (tma_dtlb_load + tma_fb_ful= l + tma_l1_latency_capacity + tma_l1_latency_dependency + tma_lock_latency = + tma_split_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bou= nd / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_sto= re_bound)) * (tma_split_stores / (tma_dtlb_store + tma_false_sharing + tma_= split_stores + tma_store_latency + tma_streaming_stores)) + tma_memory_boun= d * (tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_store_latency / (tma_dtlb_store + tma_f= alse_sharing + tma_split_stores + tma_store_latency + tma_streaming_stores)= ))", + "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "(tma_bottleneck_cache_memory_latency > 20)", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D0x1@) * (t= ma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_res= teers + tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispr= edicts) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_bra= nches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma_= ms_switches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms / = (tma_mite + tma_dsb + tma_lsd + tma_ms))) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Fed;FetchBW;Frontend", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu_core@U= OPS_RETIRED.MS\\,cmask\\=3D0x1@) * (tma_fetch_latency * (tma_ms_switches + = tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma= _other_mispredicts / tma_branch_mispredicts) / (tma_mispredicts_resteers + = tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_itl= b_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switch= es) + tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_lsd + tma_ms= )) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / t= ma_other_nukes + tma_core_bound * (tma_serializing_operation + RS.EMPTY_RES= OURCE / tma_info_thread_clks * tma_ports_utilized_0) / (tma_divider + tma_s= erializing_operation + tma_ports_utilization) + tma_microcode_sequencer / (= tma_microcode_sequencer + tma_few_uops_instructions) * (tma_assists / tma_m= icrocode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (= tma_dtlb_load / (tma_dtlb_load + tma_fb_full + tma_l1_latency_capacity + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_store_fw= d_blk)) + tma_memory_bound * (tma_store_bound / (tma_dram_bound + tma_l1_bo= und + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_dtlb_store / (= tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_latency += tma_streaming_stores)))", + "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "(tma_bottleneck_memory_data_tlbs > 20)", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * = tma_false_sharing / (tma_dtlb_store + tma_false_sharing + tma_split_stores = + tma_store_latency + tma_streaming_stores - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "(tma_bottleneck_memory_synchronization > 10)", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears, tma_remote_cache", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Offcore", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "(tma_bottleneck_other_bottlenecks > 20)", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls.", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_microcode_sequencer + tma_few_uops_instr= uctions) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Ret", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", + "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * sl= ots", + "MetricGroup": "BadSpec;BrMispredicts;BvMP;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueBM", + "MetricName": "tma_branch_mispredicts", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost,= tma_mispredicts_resteers", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers", + "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C01 / tma_info_thread_clks", + "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", + "MetricName": "tma_c01_wait", + "MetricThreshold": "tma_c01_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C02 / tma_info_thread_clks", + "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", + "MetricName": "tma_c02_wait", + "MetricThreshold": "tma_c02_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU retired uops originated from CISC (complex instruction set computer) in= struction", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Machine Clears", + "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", + "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", + "MetricName": "tma_clears_resteers", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "MetricExpr": "max(0, FRONTEND_RETIRED.L1I_MISS * cpu_core@FRONTEN= D_RETIRED.L1I_MISS@R / tma_info_thread_clks - tma_code_l2_miss)", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_hit", + "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "MetricExpr": "FRONTEND_RETIRED.L2_MISS * cpu_core@FRONTEND_RETIRE= D.L2_MISS@R / tma_info_thread_clks", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_miss", + "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, FRONTEND_RETIRED.ITLB_MISS * cpu_core@FRONTE= ND_RETIRED.ITLB_MISS@R / tma_info_thread_clks - tma_code_stlb_miss)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "FRONTEND_RETIRED.STLB_MISS * cpu_core@FRONTEND_RETI= RED.STLB_MISS@R / tma_info_thread_clks", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks * IT= LB_MISSES.WALK_COMPLETED_2M_4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISS= ES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks * IT= LB_MISSES.WALK_COMPLETED_4K / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.= WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by non-taken conditional bran= ches", + "MetricExpr": "BR_MISP_RETIRED.COND_NTAKEN_COST * cpu_core@BR_MISP= _RETIRED.COND_NTAKEN_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_cond_nt_mispredicts", + "MetricThreshold": "tma_cond_nt_mispredicts > 0.05 & tma_branch_mi= spredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by backward-taken conditional branche= s", + "MetricExpr": "BR_MISP_RETIRED.COND_TAKEN_BWD_COST * cpu_core@BR_M= ISP_RETIRED.COND_TAKEN_BWD_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_cond_tk_bwd_mispredicts", + "MetricThreshold": "tma_cond_tk_bwd_mispredicts > 0.05 & tma_branc= h_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by forward-taken conditional branches= ", + "MetricExpr": "BR_MISP_RETIRED.COND_TAKEN_FWD_COST * cpu_core@BR_M= ISP_RETIRED.COND_TAKEN_FWD_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_cond_tk_fwd_mispredicts", + "MetricThreshold": "tma_cond_tk_fwd_mispredicts > 0.05 & tma_branc= h_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", + "MetricExpr": "((min(MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS * cpu_core@= MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS * (2= 7 * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequency) i= f 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@R else MEM_LOAD_L3_HIT_RET= IRED.XSNP_MISS * (27 * tma_info_system_core_frequency) - 3 * tma_info_syste= m_core_frequency) + (min(MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * cpu_core@MEM_L= OAD_L3_HIT_RETIRED.XSNP_HITM@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (28 * t= ma_info_system_core_frequency) - 3 * tma_info_system_core_frequency) if 0 <= cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM@R else MEM_LOAD_L3_HIT_RETIRED.= XSNP_HITM * (28 * tma_info_system_core_frequency) - 3 * tma_info_system_cor= e_frequency)) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2= ) / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_contested_accesses", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_data_sharing, tma_machine_clears", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e Core non-memory issues were of a bottleneck", + "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)", + "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_b= ackend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", + "MetricExpr": "((min(MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD * cpu_cor= e@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FW= D * (27 * tma_info_system_core_frequency) - 3 * tma_info_system_core_freque= ncy) if 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@R else MEM_LOAD_L3= _HIT_RETIRED.XSNP_NO_FWD * (27 * tma_info_system_core_frequency) - 3 * tma_= info_system_core_frequency) + (min(MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * cpu_c= ore@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * = (28 * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequency)= if 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@R else MEM_LOAD_L3_HIT_RE= TIRED.XSNP_FWD * (28 * tma_info_system_core_frequency) - 3 * tma_info_syste= m_core_frequency)) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MIS= S / 2) / tma_info_thread_clks", + "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_contested_accesses, tma= _machine_clears", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Divider unit was active", + "MetricExpr": "ARITH.DIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIV_ACTIVE", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled on accesses to external memory (DRAM) by loads", + "MetricExpr": "MEMORY_STALLS.MEM / tma_info_thread_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to DSB (decoded uop cache) fetch pipe= line", + "MetricExpr": "(cpu_core@IDQ.DSB_UOPS\\,cmask\\=3D0x8\\,inv\\=3D0x= 1@ + IDQ.DSB_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS) * (IDQ_BUBBLES.CYCLES_0_= UOPS_DELIV.CORE - IDQ_BUBBLES.FETCH_LATENCY)) / tma_info_thread_clks", + "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", + "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricExpr": "(min(MEM_INST_RETIRED.STLB_HIT_LOADS * cpu_core@MEM= _INST_RETIRED.STLB_HIT_LOADS@R, MEM_INST_RETIRED.STLB_HIT_LOADS * 7) if 0 <= cpu_core@MEM_INST_RETIRED.STLB_HIT_LOADS@R else MEM_INST_RETIRED.STLB_HIT_= LOADS * 7) / tma_info_thread_clks + tma_load_stlb_miss", + "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= dtlb_store", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", + "MetricExpr": "(min(MEM_INST_RETIRED.STLB_HIT_STORES * cpu_core@ME= M_INST_RETIRED.STLB_HIT_STORES@R, MEM_INST_RETIRED.STLB_HIT_STORES * 7) if = 0 < cpu_core@MEM_INST_RETIRED.STLB_HIT_STORES@R else MEM_INST_RETIRED.STLB_= HIT_STORES * 7) / tma_info_thread_clks + tma_store_stlb_miss", + "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_dtlb_load", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", + "MetricExpr": "28 * tma_info_system_core_frequency * cpu_core@OCR.= DEMAND_RFO.L3_HIT.SNOOP_HITM@ / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", + "MetricName": "tma_false_sharing", + "MetricThreshold": "(tma_false_sharing > 0.05) & ((tma_store_bound= > 0.2) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cac= he", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", + "MetricExpr": "L1D_MISS.FB_FULL / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend bandwidth issues", + "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)", + "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_= frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.2", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Rel= ated metrics: tma_dsb_switches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_= botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_inst_mix_ipt= b, tma_lcp. Sample with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1;FRONTEN= D_RETIRED.LATENCY_GE_1;FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_= dsb_switches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_miss= es, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend latency issues", + "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend= _bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", + "MetricExpr": "max(0, tma_heavy_operations - tma_microcode_sequenc= er)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", + "MetricName": "tma_few_uops_instructions", + "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents overall arithmetic flo= ating-point (FP) operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", + "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", + "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Floating Point (FP) Assists", + "MetricExpr": "30 * ASSISTS.FP / tma_info_thread_slots", + "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", + "MetricName": "tma_fp_assists", + "MetricThreshold": "tma_fp_assists > 0.1", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "MetricExpr": "ARITH.FPDIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_fp_divider", + "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) scalar uops fraction the CPU has retired", + "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / (tma_retiring * tma_= info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", + "MetricName": "tma_fp_scalar", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b, tma_ports_utili= zed_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) vector uops fraction the CPU has retired aggregated across all v= ector widths", + "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / (tma_retiring * tma_= info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", + "MetricName": "tma_fp_vector", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, = tma_int_vector_256b, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 128-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", + "MetricName": "tma_fp_vector_128b", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b,= tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 256-bit wide vectors", + "MetricExpr": "cpu_core@FP_ARITH_INST_RETIRED.VECTOR\\,umask\\=3D0= x30@ / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", + "MetricName": "tma_fp_vector_256b", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_int_vector_128b, tma_int_vector_256b,= tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This category represents fraction of slots wh= ere the processor's Frontend undersupplies its Backend", + "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "BvFB;BvIO;PGO;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.MACRO_FUSED / (= tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", + "MetricName": "tma_fused_instructions", + "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+]). Sample with: UOPS_RET= IRED.HEAVY", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", + "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", + "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect CALL instructions= ", + "MetricExpr": "BR_MISP_RETIRED.INDIRECT_CALL_COST * cpu_core@BR_MI= SP_RETIRED.INDIRECT_CALL_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_ind_call_mispredicts", + "MetricThreshold": "tma_ind_call_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect JMP instructions", + "MetricExpr": "max((BR_MISP_RETIRED.INDIRECT_COST * cpu_core@BR_MI= SP_RETIRED.INDIRECT_COST@R - BR_MISP_RETIRED.INDIRECT_CALL_COST * cpu_core@= BR_MISP_RETIRED.INDIRECT_CALL_COST@R) / tma_info_thread_clks, 0)", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_ind_jump_mispredicts", + "MetricThreshold": "tma_ind_jump_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 8 / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "MetricGroup": "Bad;BrMispredicts;tma_issueBM", + "MetricName": "tma_info_bad_spec_branch_misprediction_cost", + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", + "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for cond= itional backward-taken branches (lower number means higher occurrence rate)= ", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN_BWD", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_cond_taken_bwd", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for cond= itional forward-taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN_FWD", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_cond_taken_fwd", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_indirect", + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_ret", + "MetricThreshold": "tma_info_bad_spec_ipmisp_ret < 500", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmispredict", + "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", + "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", + "MetricGroup": "BrMispredicts", + "MetricName": "tma_info_bad_spec_spec_clears_ratio", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_lsd + tma_ms)))", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", + "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd + tma_ms))", + "MetricGroup": "DSBmiss;Fed;tma_issueFB", + "MetricName": "tma_info_botlnk_l2_dsb_misses", + "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) misse= s - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb= _switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", + "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", + "MetricName": "tma_info_botlnk_l2_ic_misses", + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of branches that are CALL or RET", + "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_R= ETURN) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_branches_callret", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of branches that are non-taken condi= tionals", + "MetricExpr": "BR_INST_RETIRED.COND_NTAKEN / BR_INST_RETIRED.ALL_B= RANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_branches_cond_nt", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of branches that are forward taken c= onditionals", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN_BWD / BR_INST_RETIRED.AL= L_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_branches_cond_tk_bwd", + "MetricThreshold": "tma_info_branches_cond_tk_bwd > 0.3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of branches that are forward taken c= onditionals", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN_FWD / BR_INST_RETIRED.AL= L_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_branches_cond_tk_fwd", + "MetricThreshold": "tma_info_branches_cond_tk_fwd > 0.2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of branches that are unconditional (= direct or indirect) jumps", + "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_= TAKEN_BWD - BR_INST_RETIRED.COND_TAKEN_FWD - 2 * BR_INST_RETIRED.NEAR_CALL)= / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_branches_jump", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of branches of other types (not indi= vidually covered by other metrics in Info.Branches group)", + "MetricExpr": "1 - (tma_info_branches_cond_nt + tma_info_branches_= cond_tk_bwd + tma_info_branches_cond_tk_fwd + tma_info_branches_callret + t= ma_info_branches_jump)", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_branches_other_branches", + "Unit": "cpu_core" + }, + { + "BriefDescription": "uops Executed per Cycle", + "MetricExpr": "UOPS_EXECUTED.THREAD / tma_info_thread_clks", + "MetricGroup": "Power", + "MetricName": "tma_info_core_epc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Floating Point Operations Per Cycle", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / tma_info_thread_clks", + "MetricGroup": "Flops;Ret", + "MetricName": "tma_info_core_flopc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Actual per-core usage of the Floating Point n= on-X87 execution units (regardless of precision or vector-width)", + "MetricExpr": "(FP_ARITH_DISPATCHED.V0 + FP_ARITH_DISPATCHED.V1 + = FP_ARITH_DISPATCHED.V2 + FP_ARITH_DISPATCHED.V3) / (4 * tma_info_thread_clk= s)", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_core_fp_arith_utilization", + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu_core@UOPS_EXECUTED.THREA= D\\,cmask\\=3D0x1@", + "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", + "MetricName": "tma_info_core_ilp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of Uops delivered by the DSB (aka De= coded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_frontend_dsb_coverage", + "MetricThreshold": "tma_info_frontend_dsb_coverage < 0.7 & tma_inf= o_thread_ipc / 8 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu_core@DSB2MIT= E_SWITCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricGroup": "DSBmiss", + "MetricName": "tma_info_frontend_dsb_switch_cost", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired DSB misses", + "MetricExpr": "FRONTEND_RETIRED.ANY_DSB_MISS * cpu_core@FRONTEND_R= ETIRED.ANY_DSB_MISS@R / tma_info_thread_clks", + "MetricGroup": "DSBmiss;Fed;FetchLat", + "MetricName": "tma_info_frontend_dsb_switches_ret", + "MetricThreshold": "tma_info_frontend_dsb_switches_ret > 0.05", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", + "MetricExpr": "UOPS_ISSUED.ANY / cpu_core@UOPS_ISSUED.ANY\\,cmask\= \=3D0x1@", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_frontend_fetch_upc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average Latency for L1 instruction cache miss= es", + "MetricExpr": "ICACHE_DATA.STALLS / ICACHE_DATA.STALL_PERIODS", + "MetricGroup": "Fed;FetchLat;IcMiss", + "MetricName": "tma_info_frontend_icache_miss_latency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per non-speculative DSB miss (lo= wer number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", + "MetricGroup": "DSBmiss;Fed", + "MetricName": "tma_info_frontend_ipdsb_miss_ret", + "MetricThreshold": "tma_info_frontend_ipdsb_miss_ret < 50", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per speculative Unknown Branch M= isprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_inst_mix_instructions / BACLEARS.ANY", + "MetricGroup": "Fed", + "MetricName": "tma_info_frontend_ipunknown_branch", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache true code cacheline misses per kilo = instruction", + "MetricExpr": "1e3 * FRONTEND_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_frontend_l2mpki_code", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache speculative code cacheline misses pe= r kilo instruction", + "MetricExpr": "1e3 * L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_frontend_l2mpki_code_all", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of Uops delivered by the LSD (Loop S= tream Detector; aka Loop Cache)", + "MetricExpr": "LSD.UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "Fed;LSD", + "MetricName": "tma_info_frontend_lsd_coverage", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired operations that invoke th= e Microcode Sequencer", + "MetricExpr": "FRONTEND_RETIRED.MS_FLOWS * cpu_core@FRONTEND_RETIR= ED.MS_FLOWS@R / tma_info_thread_clks", + "MetricGroup": "Fed;FetchLat;MicroSeq", + "MetricName": "tma_info_frontend_ms_latency_ret", + "MetricThreshold": "tma_info_frontend_ms_latency_ret > 0.05", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of cycles the front-end was de= layed due to an Unknown Branch detection", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu_core@INT_MISC.= UNKNOWN_BRANCH_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricGroup": "Fed", + "MetricName": "tma_info_frontend_unknown_branch_cost", + "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired branches who got branch a= ddress clears", + "MetricExpr": "FRONTEND_RETIRED.UNKNOWN_BRANCH * cpu_core@FRONTEND= _RETIRED.UNKNOWN_BRANCH@R / tma_info_thread_clks", + "MetricGroup": "Fed;FetchLat", + "MetricName": "tma_info_frontend_unknown_branches_ret", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Branch instructions per taken branch", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_inst_mix_bptkbranch", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Summary;TmaL1;tma_L1_group", + "MetricName": "tma_info_inst_mix_instructions", + "PublicDescription": "Total number of retired Instructions. Sample= with: INST_RETIRED.PREC_DIST", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per FP Arithmetic instruction (l= ower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = FP_ARITH_INST_RETIRED.VECTOR)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_inst_mix_iparith", + "MetricThreshold": "tma_info_inst_mix_iparith < 10", + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_inst_mix_iparith_avx128", + "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_inst_mix_iparith_avx256", + "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOU= BLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_inst_mix_iparith_scalar_dp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SIN= GLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_inst_mix_iparith_scalar_sp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Branches;Fed;InsType", + "MetricName": "tma_info_inst_mix_ipbranch", + "MetricThreshold": "tma_info_inst_mix_ipbranch < 8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per (near) call (lower number me= ans higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_inst_mix_ipcall", + "MetricThreshold": "tma_info_inst_mix_ipcall < 200", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per Floating Point (FP) Operatio= n (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_= FLOPS + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_inst_mix_ipflop", + "MetricThreshold": "tma_info_inst_mix_ipflop < 10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per Load (lower number means hig= her occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", + "MetricGroup": "InsType", + "MetricName": "tma_info_inst_mix_ipload", + "MetricThreshold": "tma_info_inst_mix_ipload < 3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per PAUSE (lower number means hi= gher occurrence rate)", + "MetricExpr": "tma_info_inst_mix_instructions / CPU_CLK_UNHALTED.P= AUSE_INST", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_inst_mix_ippause", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per Store (lower number means hi= gher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", + "MetricGroup": "InsType", + "MetricName": "tma_info_inst_mix_ipstore", + "MetricThreshold": "tma_info_inst_mix_ipstore < 8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_SWPF", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_inst_mix_ipswpf", + "MetricThreshold": "tma_info_inst_mix_ipswpf < 100", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per taken branch", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", + "MetricName": "tma_info_inst_mix_iptb", + "MetricThreshold": "tma_info_inst_mix_iptb < 8 * 2 + 1", + "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fill Buffer (FB) hits per kilo instructions f= or retired demand loads (L1D misses that merge into ongoing miss-handling e= ntries)", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_fb_hpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= Level 0 within L1D cache [GB / sec]", + "MetricExpr": "64 * L1D.L0_REPLACEMENT / 1e9 / tma_info_system_tim= e", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_l1dl0_cache_fill_bw", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L1 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l1mpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L1 cache true misses per kilo instruction for= all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.AN= Y", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l1mpki_load", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_l2_cache_fill_bw", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache hits per kilo instruction for all re= quest types (including speculative)", + "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_= RETIRED.ANY", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l2hpki_all", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache hits per kilo instruction for all de= mand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.AN= Y", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l2hpki_load", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "Backend;CacheHits;Mem", + "MetricName": "tma_info_memory_l2mpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all request types (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Offcore", + "MetricName": "tma_info_memory_l2mpki_all", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.A= NY", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l2mpki_load", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Offcore requests (L2 cache miss) per kilo ins= truction for demand RFOs", + "MetricExpr": "1e3 * L2_RQSTS.RFO_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Offcore", + "MetricName": "tma_info_memory_l2mpki_rfo", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_memory_l3_cache_access_bw", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_l3_cache_fill_bw", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L3 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", + "MetricGroup": "Mem", + "MetricName": "tma_info_memory_l3mpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DATA_RD / OFFCORE_REQU= ESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_memory_latency_data_l2_mlp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average Latency for L2 cache miss demand Load= s", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "LockCont;Memory_Lat;Offcore", + "MetricName": "tma_info_memory_latency_load_l2_miss_latency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu_c= ore@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_memory_latency_load_l2_mlp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average Latency for L3 cache miss demand Load= s", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAND_DATA_RD= / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", + "MetricGroup": "Memory_Lat;Offcore", + "MetricName": "tma_info_memory_latency_load_l3_miss_latency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Actual Average Latency for L1 data-cache miss= demand load operations (in core cycles)", + "MetricExpr": "L1D_PENDING.LOAD / L1D_MISS.LOAD", + "MetricGroup": "Mem;MemoryBound;MemoryLat", + "MetricName": "tma_info_memory_load_miss_real_latency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "\"Bus lock\" per kilo instruction", + "MetricExpr": "1e3 * SQ_MISC.BUS_LOCK / INST_RETIRED.ANY", + "MetricGroup": "Mem", + "MetricName": "tma_info_memory_mix_bus_lock_pki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Un-cacheable retired load per kilo instructio= n", + "MetricExpr": "1e3 * MEM_LOAD_MISC_RETIRED.UC / INST_RETIRED.ANY", + "MetricGroup": "Mem", + "MetricName": "tma_info_memory_mix_uc_load_pki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Memory-Level-Parallelism (average number of L= 1 miss demand load when there is at least one such miss", + "MetricExpr": "L1D_PENDING.LOAD / L1D_PENDING.LOAD_CYCLES", + "MetricGroup": "Mem;MemoryBW;MemoryBound", + "MetricName": "tma_info_memory_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Rate of L2 HW prefetched lines that were not = used by demand accesses", + "MetricExpr": "L2_LINES_OUT.USELESS_HWPF / (L2_LINES_OUT.SILENT + = L2_LINES_OUT.NON_SILENT)", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_memory_prefetches_useless_hwpf", + "MetricThreshold": "tma_info_memory_prefetches_useless_hwpf > 0.15= ", + "Unit": "cpu_core" + }, + { + "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", + "MetricGroup": "Fed;MemoryTLB", + "MetricName": "tma_info_memory_tlb_code_stlb_mpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand loads", + "MetricExpr": "MEM_INST_RETIRED.STLB_MISS_LOADS * cpu_core@MEM_INS= T_RETIRED.STLB_MISS_LOADS@R / tma_info_thread_clks", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_load_stlb_miss_ret", + "MetricThreshold": "tma_info_memory_tlb_load_stlb_miss_ret > 0.05", + "Unit": "cpu_core" + }, + { + "BriefDescription": "STLB (2nd level TLB) data load speculative mi= sses per kilo instruction (misses of any page-size that complete the page w= alk)", + "MetricExpr": "1e3 * DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRE= D.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_load_stlb_mpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Utilization of the core's Page Walker(s) serv= ing STLB misses triggered by instruction/Load/Store accesses", + "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_P= ENDING + DTLB_STORE_MISSES.WALK_PENDING) / (4 * tma_info_thread_clks)", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_page_walks_utilization", + "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand stores", + "MetricExpr": "MEM_INST_RETIRED.STLB_MISS_STORES * cpu_core@MEM_IN= ST_RETIRED.STLB_MISS_STORES@R / tma_info_thread_clks", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_store_stlb_miss_ret", + "MetricThreshold": "tma_info_memory_tlb_store_stlb_miss_ret > 0.05= ", + "Unit": "cpu_core" + }, + { + "BriefDescription": "STLB (2nd level TLB) data store speculative m= isses per kilo instruction (misses of any page-size that complete the page = walk)", + "MetricExpr": "1e3 * DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIR= ED.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_store_stlb_mpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of uops fetched from DSB per c= ycle", + "MetricExpr": "IDQ.DSB_UOPS / IDQ.DSB_CYCLES_ANY", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_pipeline_fetch_dsb", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of uops fetched from LSD per c= ycle", + "MetricExpr": "LSD.UOPS / LSD.CYCLES_ACTIVE", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_pipeline_fetch_lsd", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of uops fetched from MITE per = cycle", + "MetricExpr": "IDQ.MITE_UOPS / IDQ.MITE_CYCLES_ANY", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_pipeline_fetch_mite", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per a microcode Assist invocatio= n", + "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", + "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", + "MetricName": "tma_info_pipeline_ipassist", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", + "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D0x1@", + "MetricGroup": "Pipeline;Ret", + "MetricName": "tma_info_pipeline_retire", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", + "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu_core@UOPS_RETIRED.= SLOTS\\,cmask\\=3D0x1@", + "MetricGroup": "MicroSeq;Pipeline;Ret", + "MetricName": "tma_info_pipeline_strings_cycles", + "MetricThreshold": "tma_info_pipeline_strings_cycles > 0.1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of cycles the processor is waiting y= et unhalted; covering legacy PAUSE instruction, as well as C0.1 / C0.2 powe= r-performance optimized states", + "MetricExpr": "CPU_CLK_UNHALTED.C0_WAIT / tma_info_thread_clks", + "MetricGroup": "C0Wait", + "MetricName": "tma_info_system_c0_wait", + "MetricThreshold": "tma_info_system_c0_wait > 0.05", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", + "MetricGroup": "Power;Summary", + "MetricName": "tma_info_system_core_frequency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average CPU Utilization (percentage)", + "MetricExpr": "tma_info_system_cpus_utilized / #num_cpus_online", + "MetricGroup": "HPC;Summary", + "MetricName": "tma_info_system_cpu_utilization", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of utilized CPUs", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_cpus_utilized", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / tma_info_system_time", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_system_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;OS", + "MetricName": "tma_info_system_ipfarbranch", + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricGroup": "OS", + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THRE= AD", + "MetricGroup": "OS", + "MetricName": "tma_info_system_kernel_utilization", + "MetricThreshold": "tma_info_system_kernel_utilization > 0.05", + "Unit": "cpu_core" + }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "power@energy\\-pkg@ * 61 / (tma_info_system_time * = 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Socket actual clocks when any core is active = on that socket", + "MetricExpr": "UNC_CLOCK.SOCKET", + "MetricGroup": "SoC", + "MetricName": "tma_info_system_socket_clks", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", + "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_system_turbo_utilization", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Pipeline", + "MetricName": "tma_info_thread_clks", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", + "MetricExpr": "1 / tma_info_thread_ipc", + "MetricGroup": "Mem;Pipeline", + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr", + "Unit": "cpu_core" + }, + { + "BriefDescription": "The ratio of Executed- by Issued-Uops", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", + "MetricGroup": "Cor;Pipeline", + "MetricName": "tma_info_thread_execute_per_issue", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", + "MetricExpr": "INST_RETIRED.ANY / tma_info_thread_clks", + "MetricGroup": "Ret;Summary", + "MetricName": "tma_info_thread_ipc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "slots", + "MetricGroup": "TmaL1;tma_L1_group", + "MetricName": "tma_info_thread_slots", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "tma_retiring * tma_info_thread_slots / INST_RETIRED= .ANY", + "MetricGroup": "Pipeline;Ret;Retire", + "MetricName": "tma_info_thread_uoppi", + "MetricThreshold": "tma_info_thread_uoppi > 1.05", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Uops per taken branch", + "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW", + "MetricName": "tma_info_thread_uptb", + "MetricThreshold": "tma_info_thread_uptb < 8 * 1.5", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "MetricExpr": "tma_divider - tma_fp_divider", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_int_divider", + "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents overall Integer (Int) = select operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_int_vector_128b + tma_int_vector_256b", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_int_operations", + "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operation= s > 0.6", + "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents 128-bit vector Integer= ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the= CPU has retired", + "MetricExpr": "INT_VEC_RETIRED.128BIT / (tma_retiring * tma_info_t= hread_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_128b", + "MetricThreshold": "tma_int_vector_128b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_int_vector_256b, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents 256-bit vector Integer= ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fraction= the CPU has retired", + "MetricExpr": "INT_VEC_RETIRED.256BIT / (tma_retiring * tma_info_t= hread_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_256b", + "MetricThreshold": "tma_int_vector_256b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_ports_utilized_= 2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", + "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", + "MetricExpr": "MEMORY_STALLS.L1 / tma_info_thread_clks", + "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit Level 1 after missing Level 0 within the L1D= cache", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L1_HIT_L1 * cpu_core@MEM_LOAD= _RETIRED.L1_HIT_L1@R, MEM_LOAD_RETIRED.L1_HIT_L1 * 9) if 0 < cpu_core@MEM_L= OAD_RETIRED.L1_HIT_L1@R else MEM_LOAD_RETIRED.L1_HIT_L1 * 9) / tma_info_thr= ead_clks", + "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", + "MetricName": "tma_l1_latency_capacity", + "MetricThreshold": "tma_l1_latency_capacity > 0.1 & tma_l1_bound >= 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", + "MetricExpr": "4 * DEPENDENT_LOADS.ANY / tma_info_thread_clks", + "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: DEPENDENT_LOADS.ANY", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", + "MetricExpr": "MEMORY_STALLS.L2 / tma_info_thread_clks", + "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L2_HIT * cpu_core@MEM_LOAD_RE= TIRED.L2_HIT@R, MEM_LOAD_RETIRED.L2_HIT * (3 * tma_info_system_core_frequen= cy)) if 0 < cpu_core@MEM_LOAD_RETIRED.L2_HIT@R else MEM_LOAD_RETIRED.L2_HIT= * (3 * tma_info_system_core_frequency)) * (1 + MEM_LOAD_RETIRED.FB_HIT / M= EM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled due to loads accesses to L3 cache or contended with a sibling Core", + "MetricExpr": "MEMORY_STALLS.L3 / tma_info_thread_clks", + "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L3_HIT * cpu_core@MEM_LOAD_RE= TIRED.L3_HIT@R, MEM_LOAD_RETIRED.L3_HIT * (12 * tma_info_system_core_freque= ncy) - 3 * tma_info_system_core_frequency) if 0 < cpu_core@MEM_LOAD_RETIRED= .L3_HIT@R else MEM_LOAD_RETIRED.L3_HIT * (12 * tma_info_system_core_frequen= cy) - 3 * tma_info_system_core_frequency) * (1 + MEM_LOAD_RETIRED.FB_HIT / = MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", + "MetricName": "tma_l3_hit_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ranch_resteers, tma_mem_latency, tma_store_latency", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "DECODE.LCP / tma_info_thread_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Load operations", + "MetricExpr": "UOPS_DISPATCHED.LOAD / (3 * tma_info_thread_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Load operations. Sample with: = UOPS_DISPATCHED.LOAD", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) DTLB was missed by load accesses, that late= r on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_dtlb_load - tma_load_stlb_miss)", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", + "MetricName": "tma_load_stlb_hit", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by load accesses, performing a= hardware page walk", + "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", + "MetricName": "tma_load_stlb_miss", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", + "MetricExpr": "MEM_INST_RETIRED.LOCK_LOADS * cpu_core@MEM_INST_RET= IRED.LOCK_LOADS@R / tma_info_thread_clks", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to LSD (Loop Stream Detector) unit", + "MetricExpr": "cpu_core@LSD.UOPS\\,cmask\\=3D0x8\\,inv\\=3D0x1@ / = tma_info_thread_clks", + "MetricGroup": "FetchBW;LSD;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", + "MetricName": "tma_lsd", + "MetricThreshold": "tma_lsd > 0.15 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Machine Clears", + "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts= )", + "MetricGroup": "BadSpec;BvMS;MachineClears;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_l1_bound= , tma_microcode_sequencer, tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu_core@OFFCORE_REQUE= STS_OUTSTANDING.DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_sq_full", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the performance was likely hurt due to latency from external memory - DRA= M ([SPR-HBM] and/or HBM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", + "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_l3_hit_latency", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", + "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_b= ound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "13 * MISC2_RETIRED.LFENCE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", + "MetricName": "tma_memory_fence", + "MetricThreshold": "tma_memory_fence > 0.05 & tma_serializing_oper= ation > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "MetricExpr": "tma_light_operations * MEM_UOP_RETIRED.ANY / (tma_r= etiring * tma_info_thread_slots)", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_memory_operations", + "MetricThreshold": "tma_memory_operations > 0.1 & tma_light_operat= ions > 0.6", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "UOPS_RETIRED.MS / tma_info_thread_slots", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_bottleneck_i= rregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, t= ma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Branch Misprediction= at execution stage", + "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * INT_= MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", + "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", + "MetricName": "tma_mispredicts_resteers", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the MITE pipeline (the legacy deco= de pipeline)", + "MetricExpr": "(cpu_core@IDQ.MITE_UOPS\\,cmask\\=3D0x8\\,inv\\=3D0= x1@ / tma_info_thread_clks + IDQ.MITE_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS)= * (IDQ_BUBBLES.CYCLES_0_UOPS_DELIV.CORE - IDQ_BUBBLES.FETCH_LATENCY)) / tm= a_info_thread_clks", + "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sa= mple with: FRONTEND_RETIRED.ANY_DSB_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "MetricExpr": "160 * ASSISTS.SSE_AVX_MIX / tma_info_thread_clks", + "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", + "MetricName": "tma_mixing_vectors", + "MetricThreshold": "tma_mixing_vectors > 0.05", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "MetricExpr": "IDQ.MS_CYCLES_ANY / tma_info_thread_clks", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", + "MetricName": "tma_ms", + "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", + "MetricExpr": "3 * IDQ.MS_SWITCHES / tma_info_thread_clks", + "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions that were not fused", + "MetricExpr": "tma_light_operations * (BR_INST_RETIRED.ALL_BRANCHE= S - INST_RETIRED.BR_FUSED) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", + "MetricName": "tma_non_fused_branches", + "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring NOP (no op) instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", + "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", + "MetricName": "tma_nop_instructions", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents the remaining light uo= ps fraction the CPU has executed - remaining means not covered by other sib= ling nodes", + "MetricExpr": "max(0, tma_light_operations - (tma_x87_use + (FP_AR= ITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIRED.VECTOR) / (tma_retiring * t= ma_info_thread_slots) + (INT_VEC_RETIRED.ADD_128 + INT_VEC_RETIRED.VNNI_128= + INT_VEC_RETIRED.ADD_256 + INT_VEC_RETIRED.MUL_256 + INT_VEC_RETIRED.VNNI= _256) / (tma_retiring * tma_info_thread_slots) + tma_memory_operations + tm= a_fused_instructions + tma_non_fused_branches))", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_other_light_ops", + "MetricThreshold": "tma_other_light_ops > 0.3 & tma_light_operatio= ns > 0.6", + "PublicDescription": "This metric represents the remaining light u= ops fraction the CPU has executed - remaining means not covered by other si= bling nodes. May undercount due to FMA double counting", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", + "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", + "MetricName": "tma_other_mispredicts", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", + "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", + "MetricName": "tma_other_nukes", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Page Faults", + "MetricExpr": "99 * ASSISTS.PAGE_FAULT / tma_info_thread_slots", + "MetricGroup": "TopdownL5;tma_L5_group;tma_assists_group", + "MetricName": "tma_page_faults", + "MetricThreshold": "tma_page_faults > 0.05", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", + "MetricExpr": "((EXE_ACTIVITY.EXE_BOUND_0_PORTS + (EXE_ACTIVITY.1_= PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_3_PORTS_UTIL)) / tma_info_thread= _clks if ARITH.DIV_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - EXE_ACTIVITY.BOUN= D_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_= 3_PORTS_UTIL) / tma_info_thread_clks)", + "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.EXE_BOUND_0_PORTS / tma_info_thread_cl= ks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", + "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", + "MetricName": "tma_ports_utilized_1", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", + "MetricName": "tma_ports_utilized_2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma= _int_vector_256b", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", + "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", + "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by (indirect) RET instruction= s", + "MetricExpr": "BR_MISP_RETIRED.RET_COST * cpu_core@BR_MISP_RETIRED= .RET_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_ret_mispredicts", + "MetricThreshold": "tma_ret_mispredicts > 0.05 & tma_branch_mispre= dicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "BvUW;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots u= tilized by useful work i.e. issued uops that eventually get retired. Ideall= y; all pipeline slots would be attributed to the Retiring category. Retiri= ng of 100% would indicate the maximum Pipeline_Width throughput was achieve= d. Maximizing Retiring typically increases the Instructions-per-cycle (see= IPC metric). Note that a high Retiring value does not necessary mean there= is no room for more performance. For example; Heavy-operations or Microco= de Assists are categorized under Retiring. They often indicate suboptimal p= erformance and can often be optimized or avoided. Sample with: UOPS_RETIRED= .SLOTS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU issue-pipeline was stalled due to serializing operations", + "MetricExpr": "(BE_STALLS.SCOREBOARD + CPU_CLK_UNHALTED.C02) / tma= _info_thread_clks", + "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", + "MetricName": "tma_serializing_operation", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: BE_STALLS.SCOREBOARD. Related metrics: tm= a_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring Shuffle operations of 256-bit vector size (FP or Int= eger)", + "MetricExpr": "tma_light_operations * INT_VEC_RETIRED.SHUFFLES / (= tma_retiring * tma_info_thread_slots)", + "MetricGroup": "HPC;Pipeline;TopdownL4;tma_L4_group;tma_other_ligh= t_ops_group", + "MetricName": "tma_shuffles_256b", + "MetricThreshold": "tma_shuffles_256b > 0.1 & tma_other_light_ops = > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to PAUSE Instructions", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "CPU_CLK_UNHALTED.PAUSE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", + "MetricName": "tma_slow_pause", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.= PAUSE_INST", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles hand= ling memory load split accesses - load that cross 64-byte cache line bounda= ry", + "MetricExpr": "(min(MEM_INST_RETIRED.SPLIT_LOADS * cpu_core@MEM_IN= ST_RETIRED.SPLIT_LOADS@R, MEM_INST_RETIRED.SPLIT_LOADS * tma_info_memory_lo= ad_miss_real_latency) if 0 < cpu_core@MEM_INST_RETIRED.SPLIT_LOADS@R else M= EM_INST_RETIRED.SPLIT_LOADS * tma_info_memory_load_miss_real_latency) / tma= _info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents rate of split store ac= cesses", + "MetricExpr": "(min(MEM_INST_RETIRED.SPLIT_STORES * cpu_core@MEM_I= NST_RETIRED.SPLIT_STORES@R, MEM_INST_RETIRED.SPLIT_STORES) if 0 < cpu_core@= MEM_INST_RETIRED.SPLIT_STORES@R else MEM_INST_RETIRED.SPLIT_STORES) / tma_i= nfo_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", + "MetricExpr": "(XQ.FULL + L1D_MISS.L2_STALLS) / tma_info_thread_cl= ks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_mem_bandwidth", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often CPU was stall= ed due to RFO store memory accesses; RFO store issue a read-for-ownership = request before the write", + "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates fraction of cyc= les when the memory subsystem had loads blocked since they could not forwar= d data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", + "MetricExpr": "(MEM_STORE_RETIRED.L2_HIT * 10 * (1 - MEM_INST_RETI= RED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_= LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE= _REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Store operations", + "MetricExpr": "(UOPS_DISPATCHED.STD + UOPS_DISPATCHED.STA) / (7 * = tma_info_thread_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Store operations. Sample with:= UOPS_DISPATCHED.STD, UOPS_DISPATCHED.STA", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the TLB was missed by store accesses, hitting in the second-l= evel TLB (STLB)", + "MetricExpr": "max(0, tma_dtlb_store - tma_store_stlb_miss)", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", + "MetricName": "tma_store_stlb_hit", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the STLB was missed by store accesses, performing a hardware page wal= k", + "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_thread_clk= s", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", + "MetricName": "tma_store_stlb_miss", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often CPU was stall= ed due to Streaming store memory accesses; Streaming store optimize out a = read request required by RFO stores", + "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", + "MetricName": "tma_streaming_stores", + "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to new branch address clears", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / tma_info_thread_cl= ks", + "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", + "MetricName": "tma_unknown_branches", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric serves as an approximation of leg= acy x87 usage", + "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", + "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", + "ScaleUnit": "100%", + "Unit": "cpu_core" + } +] diff --git a/tools/perf/pmu-events/arch/x86/arrowlake/cache.json b/tools/pe= rf/pmu-events/arch/x86/arrowlake/cache.json new file mode 100644 index 000000000000..be90e7af4f8f --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/arrowlake/cache.json @@ -0,0 +1,1471 @@ +[ + { + "BriefDescription": "Counts the number of request that were not ac= cepted into the L2Q because the L2Q is FULL.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x31", + "EventName": "CORE_REJECT_L2Q.ANY", + "PublicDescription": "Counts the number of (demand and L1 prefetch= ers) core requests rejected by the L2Q due to a full or nearly full w condi= tion which likely indicates back pressure from L2Q. It also counts request= s that would have gone directly to the XQ, but are rejected due to a full o= r nearly full condition, indicating back pressure from the IDI link. The L= 2Q may also reject transactions from a core to insure fairness between cor= es, or to delay a cores dirty eviction when the address conflicts incoming = external snoops. (Note that L2 prefetcher requests that are dropped are no= t counted by this event.)", + "SampleAfterValue": "1000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cache lines replaced in = L0 data cache.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x51", + "EventName": "L1D.L0_REPLACEMENT", + "PublicDescription": "Counts L0 data line replacements including o= pportunistic replacements, and replacements that require stall-for-replace = or block-for-replace.", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of cycles a demand request has waited = due to L1D Fill Buffer (FB) unavailability.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x49", + "EventName": "L1D_MISS.FB_FULL", + "PublicDescription": "Counts number of cycles a demand request has= waited due to L1D Fill Buffer (FB) unavailability. Demand requests include= cacheable/uncacheable demand load, store, lock or SW prefetch accesses.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of cycles a demand request has waited = due to L1D due to lack of L2 resources.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x49", + "EventName": "L1D_MISS.L2_STALLS", + "PublicDescription": "Counts number of cycles a demand request has= waited due to L1D due to lack of L2 resources. Demand requests include cac= heable/uncacheable demand load, store, lock or SW prefetch accesses.", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of demand requests that missed L1D cac= he", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x49", + "EventName": "L1D_MISS.LOAD", + "PublicDescription": "Count occurrences (rising-edge) of DCACHE_PE= NDING sub-event0. Impl. sends per-port binary inc-bit the occupancy increas= es* (at FB alloc or promotion).", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of L1D misses that are outstanding", + "Counter": "2", + "EventCode": "0x48", + "EventName": "L1D_PENDING.LOAD", + "PublicDescription": "Counts number of L1D misses that are outstan= ding in each cycle, that is each cycle the number of Fill Buffers (FB) outs= tanding required by Demand Reads. FB either is held by demand loads, or it = is held by non-demand loads and gets hit at least once by demand. The valid= outstanding interval is defined until the FB deallocation by one of the fo= llowing ways: from FB allocation, if FB is allocated by demand from the dem= and Hit FB, if it is allocated by hardware or software prefetch. Note: In t= he L1D, a Demand Read contains cacheable or noncacheable demand loads, incl= uding ones causing cache-line splits and reads due to page walks resulted f= rom any request type.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles with L1D load Misses outstanding.", + "Counter": "2", + "CounterMask": "1", + "EventCode": "0x48", + "EventName": "L1D_PENDING.LOAD_CYCLES", + "PublicDescription": "Counts duration of L1D miss outstanding in c= ycles.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache lines filling L2", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x25", + "EventName": "L2_LINES_IN.ALL", + "PublicDescription": "Counts the number of L2 cache lines filling = the L2. Counting does not cover rejects.", + "SampleAfterValue": "100003", + "UMask": "0x1f", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Modified cache lines that are evicted by L2 c= ache when triggered by an L2 cache fill.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x26", + "EventName": "L2_LINES_OUT.NON_SILENT", + "PublicDescription": "Counts the number of lines that are evicted = by L2 cache when triggered by an L2 cache fill. Those lines are in Modified= state. Modified lines are written back to L3", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x26", + "EventName": "L2_LINES_OUT.SILENT", + "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache. These lines are typically in Shared or Exclusive stat= e. A non-threaded event.", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cache lines that have been L2 hardware prefet= ched but not used by demand accesses", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x26", + "EventName": "L2_LINES_OUT.USELESS_HWPF", + "PublicDescription": "Counts the number of cache lines that have b= een prefetched by the L2 hardware prefetcher but not used by demand access = when evicted from the L2 cache", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of demand and prefetch tran= sactions that the External Queue (XQ) rejects due to a full or near full co= ndition.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x30", + "EventName": "L2_REJECT_XQ.ANY", + "PublicDescription": "Counts the number of demand and prefetch tra= nsactions that the External Queue (XQ) rejects due to a full or near full c= ondition which likely indicates back pressure from the IDI link. The XQ ma= y reject transactions from the L2Q (non-cacheable requests), BBL (L2 misses= ) and WOB (L2 write-back victims).", + "SampleAfterValue": "1000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "All accesses to L2 cache [This event is alias= to L2_RQSTS.REFERENCES, L2_RQSTS.ANY]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_REQUEST.ALL", + "PublicDescription": "Counts all requests that were hit or true mi= sses in L2 cache. True-miss excludes misses that were merged with ongoing L= 2 misses. [This event is alias to L2_RQSTS.REFERENCES, L2_RQSTS.ANY]", + "SampleAfterValue": "200003", + "UMask": "0xff", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Read requests with true-miss in L2 cache [Thi= s event is alias to L2_RQSTS.MISS]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_REQUEST.MISS", + "PublicDescription": "Counts read requests of any type with true-m= iss in the L2 cache. True-miss excludes L2 misses that were merged with ong= oing L2 misses. [This event is alias to L2_RQSTS.MISS]", + "SampleAfterValue": "200003", + "UMask": "0x3f", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Demand Data Read access L2 cache", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.ALL_DEMAND_DATA_RD", + "PublicDescription": "Counts Demand Data Read requests accessing t= he L2 cache. These requests may hit or miss L2 cache. True-miss exclude mis= ses that were merged with ongoing L2 misses. An access is counted once.", + "SampleAfterValue": "200003", + "UMask": "0xe1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "All accesses to L2 cache [This event is alias= to L2_RQSTS.REFERENCES, L2_REQUEST.ALL]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.ANY", + "PublicDescription": "Counts all requests that were hit or true mi= sses in L2 cache. True-miss excludes misses that were merged with ongoing L= 2 misses. [This event is alias to L2_RQSTS.REFERENCES, L2_REQUEST.ALL]", + "SampleAfterValue": "200003", + "UMask": "0xff", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache misses when fetching instructions", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.CODE_RD_MISS", + "PublicDescription": "Counts L2 cache misses when fetching instruc= tions.", + "SampleAfterValue": "200003", + "UMask": "0x24", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Demand Data Read requests that hit L2 cache", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.DEMAND_DATA_RD_HIT", + "PublicDescription": "Counts the number of demand Data Read reques= ts initiated by load instructions that hit L2 cache.", + "SampleAfterValue": "200003", + "UMask": "0x41", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Demand Data Read miss L2 cache", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.DEMAND_DATA_RD_MISS", + "PublicDescription": "Counts demand Data Read requests with true-m= iss in the L2 cache. True-miss excludes misses that were merged with ongoin= g L2 misses. An access is counted once.", + "SampleAfterValue": "200003", + "UMask": "0x21", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Read requests with true-miss in L2 cache [Thi= s event is alias to L2_REQUEST.MISS]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.MISS", + "PublicDescription": "Counts read requests of any type with true-m= iss in the L2 cache. True-miss excludes L2 misses that were merged with ong= oing L2 misses. [This event is alias to L2_REQUEST.MISS]", + "SampleAfterValue": "200003", + "UMask": "0x3f", + "Unit": "cpu_core" + }, + { + "BriefDescription": "All accesses to L2 cache [This event is alias= to L2_REQUEST.ALL,L2_RQSTS.ANY]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.REFERENCES", + "PublicDescription": "Counts all requests that were hit or true mi= sses in L2 cache. True-miss excludes misses that were merged with ongoing L= 2 misses. [This event is alias to L2_REQUEST.ALL,L2_RQSTS.ANY]", + "SampleAfterValue": "200003", + "UMask": "0xff", + "Unit": "cpu_core" + }, + { + "BriefDescription": "RFO requests that miss L2 cache", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.RFO_MISS", + "PublicDescription": "Counts the RFO (Read-for-Ownership) requests= that miss L2 cache.", + "SampleAfterValue": "200003", + "UMask": "0x22", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when L1D is locked", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x42", + "EventName": "LOCK_CYCLES.CACHE_LOCK_DURATION", + "PublicDescription": "This event counts the number of cycles when = the L1D is locked.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Core-originated cacheable requests that misse= d L3 (Except hardware prefetches to the L3)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x2e", + "EventName": "LONGEST_LAT_CACHE.MISS", + "PublicDescription": "Counts core-originated cacheable requests th= at miss the L3 cache (Longest Latency cache). Requests include data and cod= e reads, Reads-for-Ownership (RFOs), speculative accesses and hardware pref= etches to the L1 and L2. It does not include hardware prefetches to the L3= , and may not count other types of requests to the L3.", + "SampleAfterValue": "100003", + "UMask": "0x41", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Core-originated cacheable requests that refer= to L3 (Except hardware prefetches to the L3)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x2e", + "EventName": "LONGEST_LAT_CACHE.REFERENCE", + "PublicDescription": "Counts core-originated cacheable requests to= the L3 cache (Longest Latency cache). Requests include data and code reads= , Reads-for-Ownership (RFOs), speculative accesses and hardware prefetches = to the L1 and L2. It does not include hardware prefetches to the L3, and m= ay not count other types of requests to the L3.", + "SampleAfterValue": "100003", + "UMask": "0x4f", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of cacheable memory request= s that access the LLC. Counts on a per core basis.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x2e", + "EventName": "LONGEST_LAT_CACHE.REFERENCE", + "PublicDescription": "Counts the number of cacheable memory reques= ts that access the Last Level Cache (LLC). Requests include demand loads, r= eads for ownership (RFO), instruction fetches and L1 HW prefetches. If the = core has access to an L3 cache, the LLC is the L3 cache, otherwise it is th= e L2 cache. Counts on a per core basis.", + "SampleAfterValue": "200003", + "UMask": "0x4f", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to an instruction cache or TLB miss.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x35", + "EventName": "MEM_BOUND_STALLS_IFETCH.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x7f", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to an instruction cache or TLB miss which hit in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x35", + "EventName": "MEM_BOUND_STALLS_IFETCH.L2_HIT", + "PublicDescription": "Counts the number of cycles the core is stal= led due to an instruction cache or Translation Lookaside Buffer (TLB) miss = which hit in the L2 cache. Includes L2 Hit resulting from and L1D eviction= of another core in the same module which is longer latency than a typical = L2 hit.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to an instruction cache or TLB miss which hit in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x35", + "EventName": "MEM_BOUND_STALLS_IFETCH.L2_HIT", + "PublicDescription": "Counts the number of cycles the core is stal= led due to an instruction cache or Translation Lookaside Buffer (TLB) miss = which hit in the L2 cache.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to an instruction cache or TLB miss which missed in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x35", + "EventName": "MEM_BOUND_STALLS_IFETCH.L2_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x7e", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to an icache or itlb miss which hit in the LLC.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x35", + "EventName": "MEM_BOUND_STALLS_IFETCH.LLC_HIT", + "SampleAfterValue": "1000003", + "UMask": "0x6", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to an L1 demand load miss.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x34", + "EventName": "MEM_BOUND_STALLS_LOAD.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x7f", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to an L1 demand load miss.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x34", + "EventName": "MEM_BOUND_STALLS_LOAD.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x7f", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to a demand load which hit in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x34", + "EventName": "MEM_BOUND_STALLS_LOAD.L2_HIT", + "PublicDescription": "Counts the number of cycles a core is stalle= d due to a demand load which hit in the L2 cache. Includes L2 Hit resultin= g from and L1D eviction of another core in the same module which is longer = latency than a typical L2 hit.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to a demand load which hit in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x34", + "EventName": "MEM_BOUND_STALLS_LOAD.L2_HIT", + "PublicDescription": "Counts the number of cycles a core is stalle= d due to a demand load which hit in the L2 cache.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to a demand load which missed in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x34", + "EventName": "MEM_BOUND_STALLS_LOAD.L2_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x7e", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to a demand load which missed in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x34", + "EventName": "MEM_BOUND_STALLS_LOAD.L2_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x7e", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to a demand load miss which hit in the LLC.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x34", + "EventName": "MEM_BOUND_STALLS_LOAD.LLC_HIT", + "SampleAfterValue": "1000003", + "UMask": "0x6", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to a demand load miss which missed all the local cache= s.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x34", + "EventName": "MEM_BOUND_STALLS_LOAD.LLC_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x78", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to a demand load miss which missed all the local cache= s. If the core has access to an L3 cache, an LLC miss refers to an L3 cache= miss, otherwise it is an L2 cache miss.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x34", + "EventName": "MEM_BOUND_STALLS_LOAD.LLC_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x78", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts all retired load instructions.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.ALL_LOADS", + "PublicDescription": "Counts Instructions with at least one archit= ecturally visible load retired.", + "SampleAfterValue": "1000003", + "UMask": "0x81", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired store instructions.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.ALL_STORES", + "PublicDescription": "Counts all retired store instructions.", + "SampleAfterValue": "1000003", + "UMask": "0x82", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired software prefetch instructions.", + "Counter": "0,1,2,3", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.ALL_SWPF", + "PublicDescription": "Counts all retired software prefetch instruc= tions.", + "SampleAfterValue": "1000003", + "UMask": "0x84", + "Unit": "cpu_core" + }, + { + "BriefDescription": "All retired memory instructions.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.ANY", + "PublicDescription": "Counts all retired memory instructions - loa= ds and stores.", + "SampleAfterValue": "1000003", + "UMask": "0x87", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions with locked access.= ", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.LOCK_LOADS", + "PublicDescription": "Counts retired load instructions with locked= access.", + "SampleAfterValue": "100007", + "UMask": "0x21", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions that split across a= cacheline boundary.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.SPLIT_LOADS", + "PublicDescription": "Counts retired load instructions that split = across a cacheline boundary.", + "SampleAfterValue": "100003", + "UMask": "0x41", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired store instructions that split across = a cacheline boundary.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.SPLIT_STORES", + "PublicDescription": "Counts retired store instructions that split= across a cacheline boundary.", + "SampleAfterValue": "100003", + "UMask": "0x42", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions that hit the STLB.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.STLB_HIT_LOADS", + "PublicDescription": "Number of retired load instructions with a c= lean hit in the 2nd-level TLB (STLB).", + "SampleAfterValue": "100003", + "UMask": "0x9", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired store instructions that hit the STLB.= ", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.STLB_HIT_STORES", + "PublicDescription": "Number of retired store instructions that hi= t in the 2nd-level TLB (STLB).", + "SampleAfterValue": "100003", + "UMask": "0xa", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions that miss the STLB.= ", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.STLB_MISS_LOADS", + "PublicDescription": "Number of retired load instructions that (st= art a) miss in the 2nd-level TLB (STLB).", + "SampleAfterValue": "100003", + "UMask": "0x11", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired store instructions that miss the STLB= .", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.STLB_MISS_STORES", + "PublicDescription": "Number of retired store instructions that (s= tart a) miss in the 2nd-level TLB (STLB).", + "SampleAfterValue": "100003", + "UMask": "0x12", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions whose data sources = were a cross-core Snoop hits and forwards data from an in on-package core c= ache (induced by NI$)", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd2", + "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD", + "PublicDescription": "Counts retired load instructions whose data = sources were a cross-core Snoop hits and forwards data from an in on-packag= e core cache (induced by NI$)", + "SampleAfterValue": "20011", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions whose data sources = were HitM responses from shared L3, Hit-with-FWD is normally excluded.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd2", + "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM", + "PublicDescription": "Counts retired load instructions whose data = sources were HitM responses from shared L3, Hit-with-FWD is normally exclud= ed.", + "SampleAfterValue": "20011", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions whose data sources = were L3 hit and cross-core snoop missed in on-pkg core cache.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd2", + "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS", + "PublicDescription": "Counts the retired load instructions whose d= ata sources were L3 hit and cross-core snoop missed in on-pkg core cache.", + "SampleAfterValue": "20011", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions whose data sources = were L3 and cross-core snoop hits in on-pkg core cache", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd2", + "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD", + "PublicDescription": "Counts retired load instructions whose data = sources were L3 and cross-core snoop hits in on-pkg core cache.", + "SampleAfterValue": "20011", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions with at least 1 uncachea= ble load or lock.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd4", + "EventName": "MEM_LOAD_MISC_RETIRED.UC", + "PublicDescription": "Retired instructions with at least one load = to uncacheable memory-type, or at least one cache-line split locked access = (Bus Lock).", + "SampleAfterValue": "100007", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of completed demand load requests that= missed the L1, but hit the FB(fill buffer), because a preceding miss to th= e same cacheline initiated the line to be brought into L1, but data is not = yet ready in L1.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.FB_HIT", + "PublicDescription": "Counts retired load instructions with at lea= st one uop was load missed in L1 but hit FB (Fill Buffers) due to preceding= miss to the same cache line with data not ready.", + "SampleAfterValue": "100007", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions with L1 cache hits = as data sources", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.L1_HIT", + "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L1 data cache. This event includes all SW prefet= ches and lock instructions regardless of the data source.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts retired load instructions with at leas= t one uop that hit in the Level 1 of the L1 data cache.", + "Counter": "0,1,2,3", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.L1_HIT_L1", + "SampleAfterValue": "1000003", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions missed L1 cache as = data sources", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.L1_MISS", + "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L1 cache.", + "SampleAfterValue": "200003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions with L2 cache hits = as data sources", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.L2_HIT", + "PublicDescription": "Counts retired load instructions with L2 cac= he hits as data sources.", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions missed L2 cache as = data sources", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.L2_MISS", + "PublicDescription": "Counts retired load instructions missed L2 c= ache as data sources.", + "SampleAfterValue": "100021", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions with L3 cache hits = as data sources", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.L3_HIT", + "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L3 cache.", + "SampleAfterValue": "100021", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions missed L3 cache as = data sources", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.L3_MISS", + "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L3 cache.", + "SampleAfterValue": "50021", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of load ops retired that hi= t the L1 data cache", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L1_HIT", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load ops retired that hi= t the L1 data cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L1_HIT", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of load ops retired that mi= ss in the L1 data cache", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L1_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load ops retired that mi= ss in the L1 data cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L1_MISS", + "SampleAfterValue": "200003", + "UMask": "0x40", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of load ops retired that hi= t in the L2 cache", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L2_HIT", + "PublicDescription": "Counts the number of load ops retired that h= it in the L2 cache. Includes L2 Hit resulting from and L1D eviction of ano= ther core in the same module which is longer latency than a typical L2 hit.= ", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load ops retired that hi= t in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L2_HIT", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of load ops retired that mi= ss in the L2 cache", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L2_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x80", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load ops retired that mi= ss in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L2_MISS", + "SampleAfterValue": "200003", + "UMask": "0x80", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of load ops retired that hi= t in the L3 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L3_HIT", + "SampleAfterValue": "1000003", + "UMask": "0x1c", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of loads that hit in a writ= e combining buffer (WCB), excluding the first load that caused the WCB to a= llocate.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.WCB_HIT", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of loads that hit in a writ= e combining buffer (WCB), excluding the first load that caused the WCB to a= llocate.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.WCB_HIT", + "SampleAfterValue": "200003", + "UMask": "0x20", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cycles that uops are blo= cked for any of the following reasons: load buffer, store buffer or RSV fu= ll.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x04", + "EventName": "MEM_SCHEDULER_BLOCK.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x7", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that uops are blo= cked for any of the following reasons: load buffer, store buffer or RSV fu= ll.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x04", + "EventName": "MEM_SCHEDULER_BLOCK.ALL", + "SampleAfterValue": "20003", + "UMask": "0x7", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cycles that uops are blo= cked due to load buffer full", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x04", + "EventName": "MEM_SCHEDULER_BLOCK.LD_BUF", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that uops are blo= cked due to a load buffer full condition.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x04", + "EventName": "MEM_SCHEDULER_BLOCK.LD_BUF", + "SampleAfterValue": "20003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cycles that uops are blo= cked due to RSV full", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x04", + "EventName": "MEM_SCHEDULER_BLOCK.RSV", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that uops are blo= cked due to an RSV full condition.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x04", + "EventName": "MEM_SCHEDULER_BLOCK.RSV", + "SampleAfterValue": "20003", + "UMask": "0x4", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cycles that uops are blo= cked due to store buffer full", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x04", + "EventName": "MEM_SCHEDULER_BLOCK.ST_BUF", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that uops are blo= cked due to a store buffer full condition.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x04", + "EventName": "MEM_SCHEDULER_BLOCK.ST_BUF", + "SampleAfterValue": "20003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "MEM_STORE_RETIRED.L2_HIT", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x44", + "EventName": "MEM_STORE_RETIRED.L2_HIT", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of load uops retired.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.ALL_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x81", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load ops retired.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.ALL_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x81", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of store uops retired.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.ALL_STORES", + "SampleAfterValue": "200003", + "UMask": "0x82", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of store ops retired.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.ALL_STORES", + "SampleAfterValue": "200003", + "UMask": "0x82", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled.", + "Counter": "0,1", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_1024", + "MSRIndex": "0x3F6", + "MSRValue": "0x400", + "SampleAfterValue": "1000003", + "UMask": "0x5", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_128", + "MSRIndex": "0x3F6", + "MSRValue": "0x80", + "SampleAfterValue": "200003", + "UMask": "0x5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled.", + "Counter": "0,1", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_128", + "MSRIndex": "0x3F6", + "MSRValue": "0x80", + "SampleAfterValue": "1000003", + "UMask": "0x5", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_16", + "MSRIndex": "0x3F6", + "MSRValue": "0x10", + "SampleAfterValue": "200003", + "UMask": "0x5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled.", + "Counter": "0,1", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_16", + "MSRIndex": "0x3F6", + "MSRValue": "0x10", + "SampleAfterValue": "1000003", + "UMask": "0x5", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled.", + "Counter": "0,1", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_2048", + "MSRIndex": "0x3F6", + "MSRValue": "0x800", + "SampleAfterValue": "1000003", + "UMask": "0x5", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_256", + "MSRIndex": "0x3F6", + "MSRValue": "0x100", + "SampleAfterValue": "200003", + "UMask": "0x5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled.", + "Counter": "0,1", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_256", + "MSRIndex": "0x3F6", + "MSRValue": "0x100", + "SampleAfterValue": "1000003", + "UMask": "0x5", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_32", + "MSRIndex": "0x3F6", + "MSRValue": "0x20", + "SampleAfterValue": "200003", + "UMask": "0x5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled.", + "Counter": "0,1", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_32", + "MSRIndex": "0x3F6", + "MSRValue": "0x20", + "SampleAfterValue": "1000003", + "UMask": "0x5", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_4", + "MSRIndex": "0x3F6", + "MSRValue": "0x4", + "SampleAfterValue": "200003", + "UMask": "0x5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled.", + "Counter": "0,1", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_4", + "MSRIndex": "0x3F6", + "MSRValue": "0x4", + "SampleAfterValue": "1000003", + "UMask": "0x5", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_512", + "MSRIndex": "0x3F6", + "MSRValue": "0x200", + "SampleAfterValue": "200003", + "UMask": "0x5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled.", + "Counter": "0,1", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_512", + "MSRIndex": "0x3F6", + "MSRValue": "0x200", + "SampleAfterValue": "1000003", + "UMask": "0x5", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_64", + "MSRIndex": "0x3F6", + "MSRValue": "0x40", + "SampleAfterValue": "200003", + "UMask": "0x5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled.", + "Counter": "0,1", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_64", + "MSRIndex": "0x3F6", + "MSRValue": "0x40", + "SampleAfterValue": "1000003", + "UMask": "0x5", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_8", + "MSRIndex": "0x3F6", + "MSRValue": "0x8", + "SampleAfterValue": "200003", + "UMask": "0x5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled.", + "Counter": "0,1", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_8", + "MSRIndex": "0x3F6", + "MSRValue": "0x8", + "SampleAfterValue": "1000003", + "UMask": "0x5", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of load uops retired that p= erformed one or more locks", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOCK_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x21", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load uops retired that p= erformed one or more locks", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOCK_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x21", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of memory uops retired that= were splits.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.SPLIT", + "SampleAfterValue": "200003", + "UMask": "0x43", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of memory uops retired that= were splits.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.SPLIT", + "SampleAfterValue": "200003", + "UMask": "0x43", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of retired split load uops.= ", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.SPLIT_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x41", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired split load uops.= ", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.SPLIT_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x41", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of retired split store uops= .", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.SPLIT_STORES", + "SampleAfterValue": "200003", + "UMask": "0x42", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired split store uops= .", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.SPLIT_STORES", + "SampleAfterValue": "200003", + "UMask": "0x42", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of memory uops retired that= missed in the second level TLB.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS", + "SampleAfterValue": "200003", + "UMask": "0x13", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of load uops retired that m= iss in the second Level TLB.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x11", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of store uops retired that = miss in the second level TLB.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS_STORES", + "SampleAfterValue": "200003", + "UMask": "0x12", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of stores uops retired sam= e as MEM_UOPS_RETIRED.ALL_STORES", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STORE_LATENCY", + "SampleAfterValue": "200003", + "UMask": "0x6", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of stores uops retired sam= e as MEM_UOPS_RETIRED.ALL_STORES", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STORE_LATENCY", + "SampleAfterValue": "1000003", + "UMask": "0x6", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Retired memory uops for any access", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe5", + "EventName": "MEM_UOP_RETIRED.ANY", + "PublicDescription": "Number of retired micro-operations (uops) fo= r load or store memory accesses", + "SampleAfterValue": "1000003", + "UMask": "0xf", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y the L3 cache where a snoop hit in another cores caches, data forwarding i= s required as the data is modified.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x40001E00001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y the L3 cache where a snoop hit in another cores caches which forwarded th= e unmodified data to the requesting core.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x20001E00001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that were su= pplied by the L3 cache where a snoop hit in another cores caches, data forw= arding is required as the data is modified.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x40001E00002", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Any memory transaction that reached the SQ.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x21", + "EventName": "OFFCORE_REQUESTS.ALL_REQUESTS", + "PublicDescription": "Counts memory transactions reached the super= queue including requests initiated by the core, all L3 prefetches, page wa= lks, etc..", + "SampleAfterValue": "100003", + "UMask": "0x80", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Demand and prefetch data reads", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x21", + "EventName": "OFFCORE_REQUESTS.DATA_RD", + "PublicDescription": "Counts the demand and prefetch data reads. A= ll Core Data Reads include cacheable 'Demands' and L2 prefetchers (not L3 p= refetchers). Counting also covers reads due to page walks resulted from any= request type.", + "SampleAfterValue": "100003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cacheable and Non-Cacheable code read request= s", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x21", + "EventName": "OFFCORE_REQUESTS.DEMAND_CODE_RD", + "PublicDescription": "Counts both cacheable and Non-Cacheable code= read requests.", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Demand Data Read requests sent to uncore", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x21", + "EventName": "OFFCORE_REQUESTS.DEMAND_DATA_RD", + "PublicDescription": "Counts the Demand Data Read requests sent to= uncore. Use it in conjunction with OFFCORE_REQUESTS_OUTSTANDING to determi= ne average latency in the uncore.", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Demand RFO requests including regular RFOs, l= ocks, ItoM", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x21", + "EventName": "OFFCORE_REQUESTS.DEMAND_RFO", + "PublicDescription": "Counts the demand RFO (read for ownership) r= equests including regular RFOs, locks, ItoM.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when offcore outstanding cacheable Cor= e Data Read transactions are present in SuperQueue (SQ), queue to uncore.", + "Counter": "0,1,2,3", + "CounterMask": "1", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "PublicDescription": "Counts cycles when offcore outstanding cache= able Core Data Read transactions are present in the super queue. A transact= ion is considered to be in the Offcore outstanding state between L2 miss an= d transaction completion sent to requestor (SQ de-allocation). See correspo= nding Umask under OFFCORE_REQUESTS.", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles with offcore outstanding Code Reads tr= ansactions in the SuperQueue (SQ), queue to uncore.", + "Counter": "0,1,2,3", + "CounterMask": "1", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_CODE= _RD", + "PublicDescription": "Counts the number of offcore outstanding Cod= e Reads transactions in the super queue every cycle. The 'Offcore outstandi= ng' state of the transaction lasts from the L2 miss until the sending trans= action completion to requestor (SQ deallocation). See the corresponding Uma= sk under OFFCORE_REQUESTS.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles where at least 1 outstanding demand da= ta read request is pending.", + "Counter": "0,1,2,3", + "CounterMask": "1", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_DATA= _RD", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles with offcore outstanding demand rfo re= ads transactions in SuperQueue (SQ), queue to uncore.", + "Counter": "0,1,2,3", + "CounterMask": "1", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO", + "PublicDescription": "Counts the number of offcore outstanding dem= and rfo Reads transactions in the super queue every cycle. The 'Offcore out= standing' state of the transaction lasts from the L2 miss until the sending= transaction completion to requestor (SQ deallocation). See the correspondi= ng Umask under OFFCORE_REQUESTS.", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Offcore outstanding cacheable Core Data Read = transactions in SuperQueue (SQ), queue to uncore", + "Counter": "0,1,2,3", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.DATA_RD", + "PublicDescription": "Counts the number of offcore outstanding cac= heable Core Data Read transactions in the super queue every cycle. A transa= ction is considered to be in the Offcore outstanding state between L2 miss = and transaction completion sent to requestor (SQ de-allocation). See corres= ponding Umask under OFFCORE_REQUESTS.", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Offcore outstanding Code Reads transactions i= n the SuperQueue (SQ), queue to uncore, every cycle.", + "Counter": "0,1,2,3", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_CODE_RD", + "PublicDescription": "Counts the number of offcore outstanding Cod= e Reads transactions in the super queue every cycle. The 'Offcore outstandi= ng' state of the transaction lasts from the L2 miss until the sending trans= action completion to requestor (SQ deallocation). See the corresponding Uma= sk under OFFCORE_REQUESTS.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "For every cycle, increments by the number of = outstanding demand data read requests pending.", + "Counter": "0,1,2,3", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD", + "PublicDescription": "For every cycle, increments by the number of= outstanding demand data read requests pending. Requests are considered o= utstanding from the time they miss the core's L2 cache until the transactio= n completion message is sent to the requestor.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Store Read transactions pending for off-core.= Highly correlated.", + "Counter": "0,1,2,3", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_RFO", + "PublicDescription": "Counts the number of off-core outstanding re= ad-for-ownership (RFO) store transactions every cycle. An RFO transaction i= s considered to be in the Off-core outstanding state between L2 cache miss = and transaction completion.", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts bus locks, accounts for cache line spl= it locks and UC locks.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x2c", + "EventName": "SQ_MISC.BUS_LOCK", + "PublicDescription": "Counts the more expensive bus lock needed to= enforce cache coherency for certain memory accesses that need to be done a= tomically. Can be created by issuing an atomic instruction (via the LOCK p= refix) which causes a cache line split or accesses uncacheable memory.", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of PREFETCHNTA, PREFETCHW, = PREFETCHT0, PREFETCHT1 or PREFETCHT2 instructions executed.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x40", + "EventName": "SW_PREFETCH_ACCESS.ANY", + "SampleAfterValue": "100003", + "UMask": "0xf", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of PREFETCHNTA instructions executed.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x40", + "EventName": "SW_PREFETCH_ACCESS.NTA", + "PublicDescription": "Counts the number of PREFETCHNTA instruction= s executed.", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of PREFETCHW instructions executed.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x40", + "EventName": "SW_PREFETCH_ACCESS.PREFETCHW", + "PublicDescription": "Counts the number of PREFETCHW instructions = executed.", + "SampleAfterValue": "100003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of PREFETCHT0 instructions executed.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x40", + "EventName": "SW_PREFETCH_ACCESS.T0", + "PublicDescription": "Counts the number of PREFETCHT0 instructions= executed.", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of PREFETCHT1 or PREFETCHT2 instructio= ns executed.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x40", + "EventName": "SW_PREFETCH_ACCESS.T1_T2", + "PublicDescription": "Counts the number of PREFETCHT1 or PREFETCHT= 2 instructions executed.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to an icache miss", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.ICACHE", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to an icache miss", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.ICACHE", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_lowpower" + } +] diff --git a/tools/perf/pmu-events/arch/x86/arrowlake/floating-point.json b= /tools/perf/pmu-events/arch/x86/arrowlake/floating-point.json new file mode 100644 index 000000000000..23a80c526aa1 --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/arrowlake/floating-point.json @@ -0,0 +1,532 @@ +[ + { + "BriefDescription": "Cycles when floating-point divide unit is bus= y executing divide or square root operations.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0xb0", + "EventName": "ARITH.FPDIV_ACTIVE", + "PublicDescription": "Counts cycles when divide unit is busy execu= ting divide or square root operations. Accounts for floating-point operatio= ns only.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of cycles when any of the f= loating point dividers are active.", + "Counter": "0,1,2,3,4,5,6,7", + "CounterMask": "1", + "EventCode": "0xcd", + "EventName": "ARITH.FPDIV_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts all microcode FP assists.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc1", + "EventName": "ASSISTS.FP", + "PublicDescription": "Counts all microcode Floating Point assists.= ", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "ASSISTS.SSE_AVX_MIX", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc1", + "EventName": "ASSISTS.SSE_AVX_MIX", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of FP-arith-uops dispatched on 1st VEC= port (port 0). FP-arith-uops are of type ADD* / SUB* / MUL / FMA* / DPP.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb3", + "EventName": "FP_ARITH_DISPATCHED.V0", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of FP-arith-uops dispatched on 2nd VEC= port (port 1)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb3", + "EventName": "FP_ARITH_DISPATCHED.V1", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of FP-arith-uops dispatched on 3rd VEC= port (port 5)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb3", + "EventName": "FP_ARITH_DISPATCHED.V2", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of FP-arith-uops dispatched on 4th VEC= port", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb3", + "EventName": "FP_ARITH_DISPATCHED.V3", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.128B_PACKED_DOUBLE", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.128B_PACKED_SINGLE", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE", + "SampleAfterValue": "100003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.256B_PACKED_DOUBLE", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.256B_PACKED_SINGLE", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE", + "SampleAfterValue": "100003", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.4_FLOPS", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.4_FLOPS", + "SampleAfterValue": "100003", + "UMask": "0x18", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.SCALAR", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.SCALAR", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.SCALAR_DOUBLE", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.SCALAR_DOUBLE", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.SCALAR_SINGLE", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.SCALAR_SINGLE", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.VECTOR", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.VECTOR", + "SampleAfterValue": "1000003", + "UMask": "0x3c", + "Unit": "cpu_core" + }, + { + "BriefDescription": "FP_ARITH_INST_RETIRED.VECTOR_128B [This event= is alias to FP_ARITH_OPS_RETIRED.VECTOR_128B]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.VECTOR_128B", + "SampleAfterValue": "100003", + "UMask": "0xc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "FP_ARITH_INST_RETIRED.VECTOR_256B [This event= is alias to FP_ARITH_OPS_RETIRED.VECTOR_256B]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.VECTOR_256B", + "SampleAfterValue": "100003", + "UMask": "0x30", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts number of SSE/AVX computational 128-bi= t packed double precision floating-point instructions retired; some instruc= tions will count twice as noted below. Each count represents 2 computation= operations, one for each element. Applies to SSE* and AVX* packed double = precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN= MAX SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice = as they perform 2 calculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.128B_PACKED_DOUBLE", + "PublicDescription": "Number of SSE/AVX computational 128-bit pack= ed double precision floating-point instructions retired; some instructions = will count twice as noted below. Each count represents 2 computation opera= tions, one for each element. Applies to SSE* and AVX* packed double precis= ion floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX S= QRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as the= y perform 2 calculations per element. The DAZ and FTZ flags in the MXCSR re= gister need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of SSE/AVX computational 128-bit packe= d single precision floating-point instructions retired; some instructions w= ill count twice as noted below. Each count represents 4 computation operat= ions, one for each element. Applies to SSE* and AVX* packed single precisi= on floating-point instructions: ADD SUB MUL DIV MIN MAX RCP14 RSQRT14 SQRT = DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they pe= rform 2 calculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.128B_PACKED_SINGLE", + "PublicDescription": "Number of SSE/AVX computational 128-bit pack= ed single precision floating-point instructions retired; some instructions = will count twice as noted below. Each count represents 4 computation opera= tions, one for each element. Applies to SSE* and AVX* packed single precis= ion floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX S= QRT RSQRT RCP DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count tw= ice as they perform 2 calculations per element. The DAZ and FTZ flags in th= e MXCSR register need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts number of SSE/AVX computational 256-bi= t packed double precision floating-point instructions retired; some instruc= tions will count twice as noted below. Each count represents 4 computation= operations, one for each element. Applies to SSE* and AVX* packed double = precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN= MAX SQRT FM(N)ADD/SUB. FM(N)ADD/SUB instructions count twice as they perf= orm 2 calculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.256B_PACKED_DOUBLE", + "PublicDescription": "Number of SSE/AVX computational 256-bit pack= ed double precision floating-point instructions retired; some instructions = will count twice as noted below. Each count represents 4 computation opera= tions, one for each element. Applies to SSE* and AVX* packed double precis= ion floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX S= QRT FM(N)ADD/SUB. FM(N)ADD/SUB instructions count twice as they perform 2 = calculations per element. The DAZ and FTZ flags in the MXCSR register need = to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts number of SSE/AVX computational 256-bi= t packed single precision floating-point instructions retired; some instruc= tions will count twice as noted below. Each count represents 8 computation= operations, one for each element. Applies to SSE* and AVX* packed single = precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN= MAX SQRT RSQRT RCP DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions co= unt twice as they perform 2 calculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.256B_PACKED_SINGLE", + "PublicDescription": "Number of SSE/AVX computational 256-bit pack= ed single precision floating-point instructions retired; some instructions = will count twice as noted below. Each count represents 8 computation opera= tions, one for each element. Applies to SSE* and AVX* packed single precis= ion floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX S= QRT RSQRT RCP DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count tw= ice as they perform 2 calculations per element. The DAZ and FTZ flags in th= e MXCSR register need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of SSE/AVX computational 128-bit packe= d single and 256-bit packed double precision FP instructions retired; some = instructions will count twice as noted below. Each count represents 2 or/a= nd 4 computation operations, 1 for each element. Applies to SSE* and AVX* = packed single precision and packed double precision FP instructions: ADD SU= B HADD HSUB SUBADD MUL DIV MIN MAX RCP14 RSQRT14 SQRT DPP FM(N)ADD/SUB. DP= P and FM(N)ADD/SUB count twice as they perform 2 calculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.4_FLOPS", + "PublicDescription": "Number of SSE/AVX computational 128-bit pack= ed single precision and 256-bit packed double precision floating-point ins= tructions retired; some instructions will count twice as noted below. Each= count represents 2 or/and 4 computation operations, one for each element. = Applies to SSE* and AVX* packed single precision floating-point and packed= double precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL= DIV MIN MAX RCP14 RSQRT14 SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB ins= tructions count twice as they perform 2 calculations per element. The DAZ a= nd FTZ flags in the MXCSR register need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x18", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of SSE/AVX computational scalar floati= ng-point instructions retired; some instructions will count twice as noted = below. Applies to SSE* and AVX* scalar, double and single precision floati= ng-point: ADD SUB MUL DIV MIN MAX RCP14 RSQRT14 RANGE SQRT DPP FM(N)ADD/SUB= . DPP and FM(N)ADD/SUB instructions count twice as they perform multiple c= alculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.SCALAR", + "PublicDescription": "Number of SSE/AVX computational scalar singl= e precision and double precision floating-point instructions retired; some = instructions will count twice as noted below. Each count represents 1 comp= utational operation. Applies to SSE* and AVX* scalar single precision float= ing-point instructions: ADD SUB MUL DIV MIN MAX SQRT RSQRT RCP FM(N)ADD/SUB= . FM(N)ADD/SUB instructions count twice as they perform 2 calculations per= element. The DAZ and FTZ flags in the MXCSR register need to be set when u= sing these events.", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts number of SSE/AVX computational scalar= double precision floating-point instructions retired; some instructions wi= ll count twice as noted below. Each count represents 1 computational opera= tion. Applies to SSE* and AVX* scalar double precision floating-point instr= uctions: ADD SUB MUL DIV MIN MAX SQRT FM(N)ADD/SUB. FM(N)ADD/SUB instructi= ons count twice as they perform 2 calculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.SCALAR_DOUBLE", + "PublicDescription": "Number of SSE/AVX computational scalar doubl= e precision floating-point instructions retired; some instructions will cou= nt twice as noted below. Each count represents 1 computational operation. = Applies to SSE* and AVX* scalar double precision floating-point instruction= s: ADD SUB MUL DIV MIN MAX SQRT FM(N)ADD/SUB. FM(N)ADD/SUB instructions co= unt twice as they perform 2 calculations per element. The DAZ and FTZ flags= in the MXCSR register need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts number of SSE/AVX computational scalar= single precision floating-point instructions retired; some instructions wi= ll count twice as noted below. Each count represents 1 computational opera= tion. Applies to SSE* and AVX* scalar single precision floating-point instr= uctions: ADD SUB MUL DIV MIN MAX SQRT RSQRT RCP FM(N)ADD/SUB. FM(N)ADD/SUB= instructions count twice as they perform 2 calculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.SCALAR_SINGLE", + "PublicDescription": "Number of SSE/AVX computational scalar singl= e precision floating-point instructions retired; some instructions will cou= nt twice as noted below. Each count represents 1 computational operation. = Applies to SSE* and AVX* scalar single precision floating-point instruction= s: ADD SUB MUL DIV MIN MAX SQRT RSQRT RCP FM(N)ADD/SUB. FM(N)ADD/SUB instr= uctions count twice as they perform 2 calculations per element. The DAZ and= FTZ flags in the MXCSR register need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of any Vector retired FP arithmetic in= structions", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.VECTOR", + "PublicDescription": "Number of any Vector retired FP arithmetic i= nstructions. The DAZ and FTZ flags in the MXCSR register need to be set wh= en using these events.", + "SampleAfterValue": "1000003", + "UMask": "0x3c", + "Unit": "cpu_core" + }, + { + "BriefDescription": "FP_ARITH_OPS_RETIRED.VECTOR_128B [This event = is alias to FP_ARITH_INST_RETIRED.VECTOR_128B]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.VECTOR_128B", + "SampleAfterValue": "100003", + "UMask": "0xc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "FP_ARITH_OPS_RETIRED.VECTOR_256B [This event = is alias to FP_ARITH_INST_RETIRED.VECTOR_256B]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.VECTOR_256B", + "SampleAfterValue": "100003", + "UMask": "0x30", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of all types of floating po= int operations per uop with all default weighting", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc8", + "EventName": "FP_FLOPS_RETIRED.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of all types of floating po= int operations per uop with all default weighting", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc8", + "EventName": "FP_FLOPS_RETIRED.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "This event is deprecated. [This event is alia= s to FP_FLOPS_RETIRED.FP64]", + "Counter": "0,1,2,3,4,5,6,7", + "Deprecated": "1", + "EventCode": "0xc8", + "EventName": "FP_FLOPS_RETIRED.DP", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of floating point operation= s that produce 32 bit single precision results", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc8", + "EventName": "FP_FLOPS_RETIRED.FP32", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of floating point operation= s that produce 32 bit single precision results [This event is alias to FP_F= LOPS_RETIRED.SP]", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc8", + "EventName": "FP_FLOPS_RETIRED.FP32", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of floating point operation= s that produce 64 bit double precision results", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc8", + "EventName": "FP_FLOPS_RETIRED.FP64", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of floating point operation= s that produce 64 bit double precision results [This event is alias to FP_F= LOPS_RETIRED.DP]", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc8", + "EventName": "FP_FLOPS_RETIRED.FP64", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "This event is deprecated. [This event is alia= s to FP_FLOPS_RETIRED.FP32]", + "Counter": "0,1,2,3,4,5,6,7", + "Deprecated": "1", + "EventCode": "0xc8", + "EventName": "FP_FLOPS_RETIRED.SP", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a packed 128 bit double precision floating point. This may b= e SSE or AVX.128 operations.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.128B_DP", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the total number of floating point re= tired instructions.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.128B_DP", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a packed 128 bit single precision floating point. This may b= e SSE or AVX.128 operations.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.128B_SP", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a packed 128 bit single precision floating point. This may b= e SSE or AVX.128 operations.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.128B_SP", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a packed 256 bit double precision floating point.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.256B_DP", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a packed 256 bit double precision floating point.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.256B_DP", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a packed 256 bit single precision floating point.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.256B_SP", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a scalar 32bit single precision floating point", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.32B_SP", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a scalar 32bit single precision floating point.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.32B_SP", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a scalar 64 bit double precision floating point", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.64B_DP", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a scalar 64 bit double precision floating point.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.64B_DP", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the total number of floating point re= tired instructions.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x3f", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on floatin= g point and vector integer port 0, 1, 2, 3.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb2", + "EventName": "FP_VINT_UOPS_EXECUTED.PRIMARY", + "SampleAfterValue": "1000003", + "UMask": "0x1e", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on floatin= g point and vector integer store data port.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb2", + "EventName": "FP_VINT_UOPS_EXECUTED.STD", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of floating point operation= s retired that required microcode assist.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.FP_ASSIST", + "PublicDescription": "Counts the number of floating point operatio= ns retired that required microcode assist, which is not a reflection of the= number of FP operations, instructions or uops.", + "SampleAfterValue": "20003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of floating point operation= s retired that required microcode assist.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.FP_ASSIST", + "PublicDescription": "Counts the number of floating point operatio= ns retired that required microcode assist, which is not a reflection of the= number of FP operations, instructions or uops.", + "SampleAfterValue": "20003", + "UMask": "0x4", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of floating point divide uo= ps retired (x87 and sse, including x87 sqrt)", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.FPDIV", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of floating point divide uo= ps retired (x87 and sse, including x87 sqrt).", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.FPDIV", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_lowpower" + } +] diff --git a/tools/perf/pmu-events/arch/x86/arrowlake/frontend.json b/tools= /perf/pmu-events/arch/x86/arrowlake/frontend.json new file mode 100644 index 000000000000..fc5f4dd50fe6 --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/arrowlake/frontend.json @@ -0,0 +1,609 @@ +[ + { + "BriefDescription": "Counts the total number of BACLEARS due to al= l branch types including conditional and unconditional jumps, returns, and = indirect branches.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe6", + "EventName": "BACLEARS.ANY", + "PublicDescription": "Counts the total number of BACLEARS, which o= ccur when the Branch Target Buffer (BTB) prediction or lack thereof, was co= rrected by a later branch predictor in the frontend. Includes BACLEARS due= to all branch types including conditional and unconditional jumps, returns= , and indirect branches.", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Clears due to Unknown Branches.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x60", + "EventName": "BACLEARS.ANY", + "PublicDescription": "Number of times the front-end is resteered w= hen it finds a branch instruction in a fetch line. This is called Unknown B= ranch which occurs for the first time a branch instruction is fetched or wh= en the branch is not tracked by the BPU (Branch Prediction Unit) anymore.", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the total number of BACLEARS due to al= l branch types including conditional and unconditional jumps, returns, and = indirect branches.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe6", + "EventName": "BACLEARS.ANY", + "PublicDescription": "Counts the total number of BACLEARS, which o= ccur when the Branch Target Buffer (BTB) prediction or lack thereof, was co= rrected by a later branch predictor in the frontend. Includes BACLEARS due= to all branch types including conditional and unconditional jumps, returns= , and indirect branches.", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Stalls caused by changing prefix length of th= e instruction.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x87", + "EventName": "DECODE.LCP", + "PublicDescription": "Counts cycles that the Instruction Length de= coder (ILD) stalls occurred due to dynamically changing prefix length of th= e decoded instruction (by operand size prefix instruction 0x66, address siz= e prefix instruction 0x67 or REX.W for Intel64). Count is proportional to t= he number of prefixes in a 16B-line. This may result in a three-cycle penal= ty for each LCP (Length changing prefix) in a 16-byte chunk.", + "SampleAfterValue": "500009", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles the Microcode Sequencer is busy.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x87", + "EventName": "DECODE.MS_BUSY", + "SampleAfterValue": "500009", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "DSB-to-MITE switch true penalty cycles.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x61", + "EventName": "DSB2MITE_SWITCHES.PENALTY_CYCLES", + "PublicDescription": "Decode Stream Buffer (DSB) is a Uop-cache th= at holds translations of previously fetched instructions that were decoded = by the legacy x86 decode pipeline (MITE). This event counts fetch penalty c= ycles when a transition occurs from DSB to MITE.", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired ANT branches", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.ANY_ANT", + "MSRIndex": "0x3F7", + "MSRValue": "0x9", + "PublicDescription": "Always Not Taken (ANT) conditional retired b= ranches (no BTB entry and not mispredicted)", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired Instructions who experienced DSB miss= .", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.ANY_DSB_MISS", + "MSRIndex": "0x3F7", + "MSRValue": "0x1", + "PublicDescription": "Counts retired Instructions that experienced= DSB (Decode stream buffer i.e. the decoded instruction-cache) miss.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired Instructions who experienced a critic= al DSB miss.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.DSB_MISS", + "MSRIndex": "0x3F7", + "MSRValue": "0x11", + "PublicDescription": "Number of retired Instructions that experien= ced a critical DSB (Decode stream buffer i.e. the decoded instruction-cache= ) miss. Critical means stalls were exposed to the back-end as a result of t= he DSB miss.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged because empty issue slots were seen before the uop due to ica= che miss", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.ICACHE", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged because empty issue slots were seen before the uop due to ITL= B miss", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.ITLB_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Retired Instructions who experienced iTLB tru= e miss.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.ITLB_MISS", + "MSRIndex": "0x3F7", + "MSRValue": "0x14", + "PublicDescription": "Counts retired Instructions that experienced= iTLB (Instruction TLB) true miss.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged because empty issue slots were seen before the uop due to ITL= B miss", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.ITLB_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Retired Instructions who experienced Instruct= ion L1 Cache true miss.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.L1I_MISS", + "MSRIndex": "0x3F7", + "MSRValue": "0x12", + "PublicDescription": "Counts retired Instructions who experienced = Instruction L1 Cache true miss.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired Instructions who experienced Instruct= ion L2 Cache true miss.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.L2_MISS", + "MSRIndex": "0x3F7", + "MSRValue": "0x13", + "PublicDescription": "Counts retired Instructions who experienced = Instruction L2 Cache true miss.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 128 cycles= which was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_128", + "MSRIndex": "0x3F7", + "MSRValue": "0x608006", + "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 12= 8 cycles which was not interrupted by a back-end stall.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 16 cycles = which was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_16", + "MSRIndex": "0x3F7", + "MSRValue": "0x601006", + "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 16 cycles. During th= is period the front-end delivered no uops.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions after front-end starvati= on of at least 2 cycles", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_2", + "MSRIndex": "0x3F7", + "MSRValue": "0x600206", + "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 2 cycles which was not interrupted by a back-end stall.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 256 cycles= which was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_256", + "MSRIndex": "0x3F7", + "MSRValue": "0x610006", + "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 25= 6 cycles which was not interrupted by a back-end stall.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end had at least 1 bubble-slot for a period of 2= cycles which was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1", + "MSRIndex": "0x3F7", + "MSRValue": "0x100206", + "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after the front-end had at least 1 bubble-slot for a per= iod of 2 cycles. A bubble-slot is an empty issue-pipeline slot while there = was no RAT stall.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 32 cycles = which was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_32", + "MSRIndex": "0x3F7", + "MSRValue": "0x602006", + "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 32 cycles. During th= is period the front-end delivered no uops.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 4 cycles w= hich was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_4", + "MSRIndex": "0x3F7", + "MSRValue": "0x600406", + "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 4 = cycles which was not interrupted by a back-end stall.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 512 cycles= which was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_512", + "MSRIndex": "0x3F7", + "MSRValue": "0x620006", + "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 51= 2 cycles which was not interrupted by a back-end stall.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 64 cycles = which was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_64", + "MSRIndex": "0x3F7", + "MSRValue": "0x604006", + "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 64= cycles which was not interrupted by a back-end stall.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 8 cycles w= hich was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_8", + "MSRIndex": "0x3F7", + "MSRValue": "0x600806", + "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 8 cycles. During thi= s period the front-end delivered no uops.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Mispredicted Retired ANT branches", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.MISP_ANT", + "MSRIndex": "0x3F7", + "MSRValue": "0x9", + "PublicDescription": "ANT retired branches that got just mispredic= ted", + "SampleAfterValue": "100007", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts flows delivered by the Microcode Seque= ncer", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.MS_FLOWS", + "MSRIndex": "0x3F7", + "MSRValue": "0x8", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired Instructions who experienced STLB (2n= d level TLB) true miss.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.STLB_MISS", + "MSRIndex": "0x3F7", + "MSRValue": "0x15", + "PublicDescription": "Counts retired Instructions that experienced= STLB (2nd level TLB) true miss.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that caused clears due t= o being Unknown Branches.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.UNKNOWN_BRANCH", + "MSRIndex": "0x3F7", + "MSRValue": "0x17", + "PublicDescription": "Number retired branch instructions that caus= ed the front-end to be resteered when it finds the instruction in a fetch l= ine. This is called Unknown Branch which occurs for the first time a branch= instruction is fetched or when the branch is not tracked by the BPU (Branc= h Prediction Unit) anymore.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged because empty issue slots were seen before the uop due to Ins= truction L1 cache miss, that hit in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc9", + "EventName": "FRONTEND_RETIRED_SOURCE.ICACHE_L2_HIT", + "PublicDescription": "Counts the number of instructions retired th= at were tagged because empty issue slots were seen before the uop due to In= struction L1 cache miss, that hit in the L2 cache. Includes L2 Hit resulti= ng from and L1D eviction of another core in the same module which is longer= latency than a typical L2 hit.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged because empty issue slots were seen before the uop due to Ins= truction L1 cache miss, that missed in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc9", + "EventName": "FRONTEND_RETIRED_SOURCE.ICACHE_L2_MISS", + "SampleAfterValue": "1000003", + "UMask": "0xe", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged because empty issue slots were seen before the uop due to Ins= truction L1 cache miss, that hit in the L3 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc9", + "EventName": "FRONTEND_RETIRED_SOURCE.ICACHE_L3_HIT", + "SampleAfterValue": "1000003", + "UMask": "0x6", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged because empty issue slots were seen before the uop due to ITL= B miss that hit in the second level TLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc9", + "EventName": "FRONTEND_RETIRED_SOURCE.ITLB_STLB_HIT", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged because empty issue slots were seen before the uop due to ITL= B miss that also missed the second level TLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc9", + "EventName": "FRONTEND_RETIRED_SOURCE.ITLB_STLB_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts every time the code stream enters into= a new cache line by walking sequential from the previous line or being red= irected by a jump.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x80", + "EventName": "ICACHE.ACCESSES", + "SampleAfterValue": "200003", + "UMask": "0x3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts every time the code stream enters into= a new cache line by walking sequential from the previous line or being red= irected by a jump.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x80", + "EventName": "ICACHE.ACCESSES", + "SampleAfterValue": "200003", + "UMask": "0x3", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts every time the code stream enters into= a new cache line by walking sequential from the previous line or being red= irected by a jump and the instruction cache registers bytes are present.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x80", + "EventName": "ICACHE.HIT", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts every time the code stream enters into= a new cache line by walking sequential from the previous line or being red= irected by a jump and the instruction cache registers bytes are not present= . -", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x80", + "EventName": "ICACHE.MISSES", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts every time the code stream enters into= a new cache line by walking sequential from the previous line or being red= irected by a jump and the instruction cache registers bytes are not present= . -", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x80", + "EventName": "ICACHE.MISSES", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Cycles where a code fetch is stalled due to L= 1 instruction cache miss.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x80", + "EventName": "ICACHE_DATA.STALLS", + "PublicDescription": "Counts cycles where a code line fetch is sta= lled due to an L1 instruction cache miss. The decode pipeline works at a 32= Byte granularity.", + "SampleAfterValue": "500009", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "ICACHE_DATA.STALL_PERIODS", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0x80", + "EventName": "ICACHE_DATA.STALL_PERIODS", + "SampleAfterValue": "500009", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles where a code fetch is stalled due to L= 1 instruction cache tag miss.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x83", + "EventName": "ICACHE_TAG.STALLS", + "PublicDescription": "Counts cycles where a code fetch is stalled = due to L1 instruction cache tag miss.", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles Decode Stream Buffer (DSB) is deliveri= ng any Uop", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x79", + "EventName": "IDQ.DSB_CYCLES_ANY", + "PublicDescription": "Counts the number of cycles uops were delive= red to Instruction Decode Queue (IDQ) from the Decode Stream Buffer (DSB) p= ath.", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles DSB is delivering optimal number of Uo= ps", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "8", + "EventCode": "0x79", + "EventName": "IDQ.DSB_CYCLES_OK", + "PublicDescription": "Counts the number of cycles where optimal nu= mber of uops was delivered to the Instruction Decode Queue (IDQ) from the D= SB (Decode Stream Buffer) path. Count includes uops that may 'bypass' the I= DQ.", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Uops delivered to Instruction Decode Queue (I= DQ) from the Decode Stream Buffer (DSB) path", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x79", + "EventName": "IDQ.DSB_UOPS", + "PublicDescription": "Counts the number of uops delivered to Instr= uction Decode Queue (IDQ) from the Decode Stream Buffer (DSB) path.", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles MITE is delivering any Uop", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x79", + "EventName": "IDQ.MITE_CYCLES_ANY", + "PublicDescription": "Counts the number of cycles uops were delive= red to the Instruction Decode Queue (IDQ) from the MITE (legacy decode pipe= line) path. During these cycles uops are not being delivered from the Decod= e Stream Buffer (DSB).", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles MITE is delivering optimal number of U= ops", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "8", + "EventCode": "0x79", + "EventName": "IDQ.MITE_CYCLES_OK", + "PublicDescription": "Counts the number of cycles where optimal nu= mber of uops was delivered to the Instruction Decode Queue (IDQ) from the M= ITE (legacy decode pipeline) path. During these cycles uops are not being d= elivered from the Decode Stream Buffer (DSB).", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Uops delivered to Instruction Decode Queue (I= DQ) from MITE path", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x79", + "EventName": "IDQ.MITE_UOPS", + "PublicDescription": "Counts the number of uops delivered to Instr= uction Decode Queue (IDQ) from the MITE path. This also means that uops are= not being delivered from the Decode Stream Buffer (DSB).", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when uops are being delivered to IDQ w= hile MS is busy", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x79", + "EventName": "IDQ.MS_CYCLES_ANY", + "PublicDescription": "Counts cycles during which uops are being de= livered to Instruction Decode Queue (IDQ) while the Microcode Sequencer (MS= ) is busy. Uops maybe initiated by Decode Stream Buffer (DSB) or MITE.", + "SampleAfterValue": "2000003", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of switches from DSB or MITE to the MS= ", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0x79", + "EventName": "IDQ.MS_SWITCHES", + "PublicDescription": "Number of switches from DSB (Decode Stream B= uffer) or MITE (legacy decode pipeline) to the Microcode Sequencer.", + "SampleAfterValue": "100003", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Uops initiated by MITE or Decode Stream Buffe= r (DSB) and delivered to Instruction Decode Queue (IDQ) while Microcode Seq= uencer (MS) is busy", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x79", + "EventName": "IDQ.MS_UOPS", + "PublicDescription": "Counts the number of uops initiated by MITE = or Decode Stream Buffer (DSB) and delivered to Instruction Decode Queue (ID= Q) while the Microcode Sequencer (MS) is busy. Counting includes uops that = may 'bypass' the IDQ.", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event counts a subset of the Topdown Slo= ts event that when no operation was delivered to the back-end pipeline due = to instruction fetch limitations when the back-end could have accepted more= operations. Common examples include instruction cache misses or x86 instru= ction decode limitations.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x9c", + "EventName": "IDQ_BUBBLES.CORE", + "PublicDescription": "This event counts a subset of the Topdown Sl= ots event that when no operation was delivered to the back-end pipeline due= to instruction fetch limitations when the back-end could have accepted mor= e operations. Common examples include instruction cache misses or x86 instr= uction decode limitations. Software can use this event as the numerator for= the Frontend Bound metric (or top-level category) of the Top-down Microarc= hitecture Analysis method.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. [This event is alia= s to IDQ_BUBBLES.STARVATION_CYCLES]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "8", + "Deprecated": "1", + "EventCode": "0x9c", + "EventName": "IDQ_BUBBLES.CYCLES_0_UOPS_DELIV.CORE", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when optimal number of uops was delive= red to the back-end when the back-end is not stalled", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x9c", + "EventName": "IDQ_BUBBLES.CYCLES_FE_WAS_OK", + "Invert": "1", + "PublicDescription": "Counts the number of cycles when the optimal= number of uops were delivered by the Instruction Decode Queue (IDQ) to the= back-end of the pipeline when there was no back-end stalls.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when no uops are delivered by the IDQ = for 2 or more cycles when backend of the machine is not stalled - normally = indicating a Fetch Latency issue", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x9c", + "EventName": "IDQ_BUBBLES.FETCH_LATENCY", + "PublicDescription": "Counts the number of cycles when no uops wer= e delivered by the Instruction Decode Queue (IDQ) to the back-end of the pi= peline when there was no back-end stalls for 2 or more cycles - normally in= dicating a Fetch Latency issue.", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when no uops are not delivered by the = IDQ when backend of the machine is not stalled [This event is alias to IDQ_= BUBBLES.CYCLES_0_UOPS_DELIV.CORE]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "8", + "EventCode": "0x9c", + "EventName": "IDQ_BUBBLES.STARVATION_CYCLES", + "PublicDescription": "Counts the number of cycles when no uops wer= e delivered by the Instruction Decode Queue (IDQ) to the back-end of the pi= peline when there was no back-end stalls. [This event is alias to IDQ_BUBBL= ES.CYCLES_0_UOPS_DELIV.CORE]", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of cycles that the micro-se= quencer is busy.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe7", + "EventName": "MS_DECODED.MS_BUSY", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_lowpower" + } +] diff --git a/tools/perf/pmu-events/arch/x86/arrowlake/memory.json b/tools/p= erf/pmu-events/arch/x86/arrowlake/memory.json new file mode 100644 index 000000000000..08f01fc66fef --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/arrowlake/memory.json @@ -0,0 +1,387 @@ +[ + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer is stalled due to any number of reasons, incl= uding an L1 miss, WCB full, pagewalk, store address block or store data blo= ck, on a load that retires.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.ANY_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0xff", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer is stalled due to any number of reasons, incl= uding an L1 miss, WCB full, pagewalk, store address block or store data blo= ck, on a load that retires.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.ANY_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0xff", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer is stalled due to a core bound stall includin= g a store address match, a DTLB miss or a page walk that detains the load f= rom retiring.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.L1_BOUND_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0xf4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer is stalled due to a core bound stall includin= g a store address match, a DTLB miss or a page walk that detains the load f= rom retiring.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.L1_BOUND_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0xf4", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer is stalled due to a DL1 miss.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.L1_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to a DL1 = miss.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.L1_MISS_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0x81", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to a DL1 = miss.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.L1_MISS_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0x81", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to other = block cases.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.OTHER_AT_RET", + "PublicDescription": "Counts the number of cycles that the head (o= ldest load) of the load buffer and retirement are both stalled due to other= block cases such as pipeline conflicts, fences, etc.", + "SampleAfterValue": "1000003", + "UMask": "0xc0", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to other = block cases.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.OTHER_AT_RET", + "PublicDescription": "Counts the number of cycles that the head (o= ldest load) of the load buffer and retirement are both stalled due to other= block cases such as pipeline conflicts, fences, etc.", + "SampleAfterValue": "1000003", + "UMask": "0xc0", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to a page= walk.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.PGWALK_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0xa0", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to a page= walk.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.PGWALK_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0xa0", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to a stor= e address match.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.ST_ADDR_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0x84", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to a stor= e address match.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.ST_ADDR_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0x84", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to reques= t buffers full or lock in progress.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.WCB_FULL_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0x82", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of memory ordering machine = clears triggered due to a snoop from an external agent. Does not count inte= rnally generated machine clears such as those due to disambiguations.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.MEMORY_ORDERING", + "SampleAfterValue": "20003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of machine clears due to memory orderi= ng conflicts.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.MEMORY_ORDERING", + "PublicDescription": "Counts the number of Machine Clears detected= dye to memory ordering. Memory Ordering Machine Clears may apply when a me= mory read may not conform to the memory ordering rules of the x86 architect= ure", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of machine clears due to me= mory ordering caused by a snoop from an external agent. Does not count inte= rnally generated machine clears such as those due to memory disambiguation.= ", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.MEMORY_ORDERING", + "SampleAfterValue": "20003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 1024 cycles.", + "Counter": "2,3,4,5,6,7,8,9", + "Data_LA": "1", + "EventCode": "0xcd", + "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_1024", + "MSRIndex": "0x3F6", + "MSRValue": "0x400", + "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 1024 cycles. Reporte= d latency may be longer than just the memory latency.", + "SampleAfterValue": "53", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 128 cycles.", + "Counter": "2,3,4,5,6,7,8,9", + "Data_LA": "1", + "EventCode": "0xcd", + "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_128", + "MSRIndex": "0x3F6", + "MSRValue": "0x80", + "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 128 cycles. Reported= latency may be longer than just the memory latency.", + "SampleAfterValue": "1009", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 16 cycles.", + "Counter": "2,3,4,5,6,7,8,9", + "Data_LA": "1", + "EventCode": "0xcd", + "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_16", + "MSRIndex": "0x3F6", + "MSRValue": "0x10", + "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 16 cycles. Reported = latency may be longer than just the memory latency.", + "SampleAfterValue": "20011", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 2048 cycles.", + "Counter": "2,3,4,5,6,7,8,9", + "Data_LA": "1", + "EventCode": "0xcd", + "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_2048", + "MSRIndex": "0x3F6", + "MSRValue": "0x800", + "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 2048 cycles. Reporte= d latency may be longer than just the memory latency.", + "SampleAfterValue": "23", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 256 cycles.", + "Counter": "2,3,4,5,6,7,8,9", + "Data_LA": "1", + "EventCode": "0xcd", + "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_256", + "MSRIndex": "0x3F6", + "MSRValue": "0x100", + "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 256 cycles. Reported= latency may be longer than just the memory latency.", + "SampleAfterValue": "503", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 32 cycles.", + "Counter": "2,3,4,5,6,7,8,9", + "Data_LA": "1", + "EventCode": "0xcd", + "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32", + "MSRIndex": "0x3F6", + "MSRValue": "0x20", + "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 32 cycles. Reported = latency may be longer than just the memory latency.", + "SampleAfterValue": "100007", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 4 cycles.", + "Counter": "2,3,4,5,6,7,8,9", + "Data_LA": "1", + "EventCode": "0xcd", + "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4", + "MSRIndex": "0x3F6", + "MSRValue": "0x4", + "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 4 cycles. Reported l= atency may be longer than just the memory latency.", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 512 cycles.", + "Counter": "2,3,4,5,6,7,8,9", + "Data_LA": "1", + "EventCode": "0xcd", + "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_512", + "MSRIndex": "0x3F6", + "MSRValue": "0x200", + "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 512 cycles. Reported= latency may be longer than just the memory latency.", + "SampleAfterValue": "101", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 64 cycles.", + "Counter": "2,3,4,5,6,7,8,9", + "Data_LA": "1", + "EventCode": "0xcd", + "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64", + "MSRIndex": "0x3F6", + "MSRValue": "0x40", + "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 64 cycles. Reported = latency may be longer than just the memory latency.", + "SampleAfterValue": "2003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 8 cycles.", + "Counter": "2,3,4,5,6,7,8,9", + "Data_LA": "1", + "EventCode": "0xcd", + "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_8", + "MSRIndex": "0x3F6", + "MSRValue": "0x8", + "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 8 cycles. Reported l= atency may be longer than just the memory latency.", + "SampleAfterValue": "50021", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired memory store access operations. A PDi= st event for PEBS Store Latency Facility.", + "Counter": "0,1", + "Data_LA": "1", + "EventCode": "0xcd", + "EventName": "MEM_TRANS_RETIRED.STORE_SAMPLE", + "PublicDescription": "Counts Retired memory accesses with at least= 1 store operation. This PEBS event is the precisely-distributed (PDist) tr= igger covering all stores uops for sampling by the PEBS Store Latency Facil= ity. The facility is described in Intel SDM Volume 3 section 19.9.8", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts misaligned loads that are 4K page spli= ts.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x13", + "EventName": "MISALIGN_MEM_REF.LOAD_PAGE_SPLIT", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts misaligned loads that are 4K page spli= ts.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x13", + "EventName": "MISALIGN_MEM_REF.LOAD_PAGE_SPLIT", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts misaligned stores that are 4K page spl= its.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x13", + "EventName": "MISALIGN_MEM_REF.STORE_PAGE_SPLIT", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts misaligned stores that are 4K page spl= its.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x13", + "EventName": "MISALIGN_MEM_REF.STORE_PAGE_SPLIT", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts demand data reads that were not suppli= ed by the L3 cache.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.L3_MISS", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0xFE7F8000001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that were no= t supplied by the L3 cache.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_RFO.L3_MISS", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0xFE7F8000002", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts demand data read requests that miss th= e L3 cache.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x21", + "EventName": "OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles where data return is pending for a Dem= and Data Read request who miss L3 cache.", + "Counter": "0,1,2,3", + "CounterMask": "1", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_L3_MISS_DEM= AND_DATA_RD", + "PublicDescription": "Cycles with at least 1 Demand Data Read requ= ests who miss L3 cache in the superQ.", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "For every cycle, increments by the number of = demand data read requests pending that are known to have missed the L3 cach= e.", + "Counter": "0,1,2,3", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAND_DATA_RD", + "PublicDescription": "For every cycle, increments by the number of= demand data read requests pending that are known to have missed the L3 cac= he. Note that this does not capture all elapsed cycles while requests are = outstanding - only cycles from when the requests were known by the requesti= ng core to have missed the L3 cache.", + "SampleAfterValue": "2000003", + "UMask": "0x10", + "Unit": "cpu_core" + } +] diff --git a/tools/perf/pmu-events/arch/x86/arrowlake/metricgroups.json b/t= ools/perf/pmu-events/arch/x86/arrowlake/metricgroups.json new file mode 100644 index 000000000000..855585fe6fae --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/arrowlake/metricgroups.json @@ -0,0 +1,150 @@ +{ + "Backend": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "Bad": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "BadSpec": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "BigFootprint": "Grouping from Top-down Microarchitecture Analysis Met= rics spreadsheet", + "BrMispredicts": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", + "Branches": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "BvBC": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvBO": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvCB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvFB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvIO": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvML": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMP": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMS": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMT": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvOB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvUW": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "C0Wait": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "CacheHits": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "CacheMisses": "Grouping from Top-down Microarchitecture Analysis Metr= ics spreadsheet", + "CodeGen": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "Compute": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "Cor": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "DSB": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "DSBmiss": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "DataSharing": "Grouping from Top-down Microarchitecture Analysis Metr= ics spreadsheet", + "Fed": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "FetchBW": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "FetchLat": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "Flops": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "FpScalar": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "FpVector": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "Frontend": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "HPC": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "IcMiss": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "Ifetch": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "InsType": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "IntVector": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Load_Store_Miss": "Grouping from Top-down Microarchitecture Analysis = Metrics spreadsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", + "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", + "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "MemOffcore": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", + "Mem_Exec": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "MemoryBW": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "MemoryBound": "Grouping from Top-down Microarchitecture Analysis Metr= ics spreadsheet", + "MemoryLat": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "MemoryTLB": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "Memory_BW": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "Memory_Lat": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", + "MicroSeq": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "OS": "Grouping from Top-down Microarchitecture Analysis Metrics sprea= dsheet", + "Offcore": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "PGO": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Pipeline": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "PortsUtil": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "Power": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "Prefetches": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", + "Ret": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Retire": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "SMT": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Server": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "Snoop": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "SoC": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Summary": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "TmaL1": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "TmaL2": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "TmaL3mem": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "TopdownL1": "Metrics for top-down breakdown at level 1", + "TopdownL2": "Metrics for top-down breakdown at level 2", + "TopdownL3": "Metrics for top-down breakdown at level 3", + "TopdownL4": "Metrics for top-down breakdown at level 4", + "TopdownL5": "Metrics for top-down breakdown at level 5", + "TopdownL6": "Metrics for top-down breakdown at level 6", + "load_store_bound": "Grouping from Top-down Microarchitecture Analysis= Metrics spreadsheet", + "tma_L1_group": "Metrics for top-down breakdown at level 1", + "tma_L2_group": "Metrics for top-down breakdown at level 2", + "tma_L3_group": "Metrics for top-down breakdown at level 3", + "tma_L4_group": "Metrics for top-down breakdown at level 4", + "tma_L5_group": "Metrics for top-down breakdown at level 5", + "tma_L6_group": "Metrics for top-down breakdown at level 6", + "tma_alu_op_utilization_group": "Metrics contributing to tma_alu_op_ut= ilization category", + "tma_assists_group": "Metrics contributing to tma_assists category", + "tma_backend_bound_group": "Metrics contributing to tma_backend_bound = category", + "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", + "tma_branch_mispredicts_group": "Metrics contributing to tma_branch_mi= spredicts category", + "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", + "tma_code_stlb_miss_group": "Metrics contributing to tma_code_stlb_mis= s category", + "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", + "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", + "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", + "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", + "tma_fetch_bandwidth_group": "Metrics contributing to tma_fetch_bandwi= dth category", + "tma_fetch_latency_group": "Metrics contributing to tma_fetch_latency = category", + "tma_fp_arith_group": "Metrics contributing to tma_fp_arith category", + "tma_fp_vector_group": "Metrics contributing to tma_fp_vector category= ", + "tma_frontend_bound_group": "Metrics contributing to tma_frontend_boun= d category", + "tma_heavy_operations_group": "Metrics contributing to tma_heavy_opera= tions category", + "tma_icache_misses_group": "Metrics contributing to tma_icache_misses = category", + "tma_ifetch_bandwidth_group": "Metrics contributing to tma_ifetch_band= width category", + "tma_ifetch_latency_group": "Metrics contributing to tma_ifetch_latenc= y category", + "tma_int_operations_group": "Metrics contributing to tma_int_operation= s category", + "tma_issue2P": "Metrics related by the issue $issue2P", + "tma_issueBM": "Metrics related by the issue $issueBM", + "tma_issueBW": "Metrics related by the issue $issueBW", + "tma_issueComp": "Metrics related by the issue $issueComp", + "tma_issueD0": "Metrics related by the issue $issueD0", + "tma_issueFB": "Metrics related by the issue $issueFB", + "tma_issueFL": "Metrics related by the issue $issueFL", + "tma_issueL1": "Metrics related by the issue $issueL1", + "tma_issueLat": "Metrics related by the issue $issueLat", + "tma_issueMC": "Metrics related by the issue $issueMC", + "tma_issueMS": "Metrics related by the issue $issueMS", + "tma_issueMV": "Metrics related by the issue $issueMV", + "tma_issueRFO": "Metrics related by the issue $issueRFO", + "tma_issueSL": "Metrics related by the issue $issueSL", + "tma_issueSO": "Metrics related by the issue $issueSO", + "tma_issueSmSt": "Metrics related by the issue $issueSmSt", + "tma_issueSpSt": "Metrics related by the issue $issueSpSt", + "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", + "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", + "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", + "tma_l2_bound_group": "Metrics contributing to tma_l2_bound category", + "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", + "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", + "tma_load_op_utilization_group": "Metrics contributing to tma_load_op_= utilization category", + "tma_load_stlb_miss_group": "Metrics contributing to tma_load_stlb_mis= s category", + "tma_machine_clears_group": "Metrics contributing to tma_machine_clear= s category", + "tma_mem_latency_group": "Metrics contributing to tma_mem_latency cate= gory", + "tma_memory_bound_group": "Metrics contributing to tma_memory_bound ca= tegory", + "tma_microcode_sequencer_group": "Metrics contributing to tma_microcod= e_sequencer category", + "tma_mite_group": "Metrics contributing to tma_mite category", + "tma_other_light_ops_group": "Metrics contributing to tma_other_light_= ops category", + "tma_ports_utilization_group": "Metrics contributing to tma_ports_util= ization category", + "tma_ports_utilized_0_group": "Metrics contributing to tma_ports_utili= zed_0 category", + "tma_ports_utilized_3m_group": "Metrics contributing to tma_ports_util= ized_3m category", + "tma_resource_bound_group": "Metrics contributing to tma_resource_boun= d category", + "tma_retiring_group": "Metrics contributing to tma_retiring category", + "tma_serializing_operation_group": "Metrics contributing to tma_serial= izing_operation category", + "tma_store_bound_group": "Metrics contributing to tma_store_bound cate= gory", + "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category", + "tma_store_stlb_miss_group": "Metrics contributing to tma_store_stlb_m= iss category" +} diff --git a/tools/perf/pmu-events/arch/x86/arrowlake/other.json b/tools/pe= rf/pmu-events/arch/x86/arrowlake/other.json new file mode 100644 index 000000000000..0175b2193201 --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/arrowlake/other.json @@ -0,0 +1,279 @@ +[ + { + "BriefDescription": "Count all other hardware assists or traps tha= t are not necessarily architecturally exposed (through a software handler) = beyond FP; SSE-AVX mix and A/D assists who are counted by dedicated sub-eve= nts.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc1", + "EventName": "ASSISTS.HARDWARE", + "PublicDescription": "Count all other hardware assists or traps th= at are not necessarily architecturally exposed (through a software handler)= beyond FP; SSE-AVX mix and A/D assists who are counted by dedicated sub-ev= ents. This includes, but not limited to, assists at EXE or MEM uop writeba= ck like AVX* load/store/gather/scatter (non-FP GSSE-assist ) , assists gene= rated by ROB like PEBS and RTIT, Uncore trap, RAR (Remote Action Request) a= nd CET (Control flow Enforcement Technology) assists.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "ASSISTS.PAGE_FAULT", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc1", + "EventName": "ASSISTS.PAGE_FAULT", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts cycles where the pipeline is stalled d= ue to serializing operations.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa2", + "EventName": "BE_STALLS.SCOREBOARD", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Count number of times a load is depending on = another load that had just write back its data or in previous or 2 cycles = back. This event supports in-direct dependency through a single uop.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x02", + "EventName": "DEPENDENT_LOADS.ANY", + "SampleAfterValue": "1000003", + "UMask": "0x7", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of uops executed on seconda= ry integer ports 0,1,2,3.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.2ND", + "SampleAfterValue": "1000003", + "UMask": "0x80", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on a load = port.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.LD", + "PublicDescription": "Counts the number of uops executed on a load= port. This event counts for integer uops even if the destination is FP/ve= ctor", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on integer= port 0,1, 2, 3.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.PRIMARY", + "SampleAfterValue": "1000003", + "UMask": "0x78", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on a Store= address port.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.STA", + "PublicDescription": "Counts the number of uops executed on a Stor= e address port. This event counts integer uops even if the data source is F= P/vector", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on an inte= ger store data and jump port.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.STD_JMP", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This event is deprecated. [This event is alia= s to MISC_RETIRED.LBR_INSERTS]", + "Counter": "0,1,2,3,4,5,6,7", + "Deprecated": "1", + "EventCode": "0xe4", + "EventName": "LBR_INSERTS.ANY", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for L1 cache (that is: no execution & load in flight = & no load missed L1 cache)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x46", + "EventName": "MEMORY_STALLS.L1", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for L2 cache (that is: no execution & load in flight = & load missed L1 & no load missed L2 cache)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x46", + "EventName": "MEMORY_STALLS.L2", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for L3 cache (that is: no execution & load in flight = & load missed L1 & load missed L2 cache & no load missed L3 Cache)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x46", + "EventName": "MEMORY_STALLS.L3", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for Memory (that is: no execution & load in flight & = a load missed L3 cache)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x46", + "EventName": "MEMORY_STALLS.MEM", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts demand data reads that have any type o= f response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1E780000001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that have an= y type of response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10002", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts streaming stores which modify a full 6= 4 byte cacheline that have any type of response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.FULL_STREAMING_WR.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x800000010000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts streaming stores which modify only par= t of a 64 byte cacheline that have any type of response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.PARTIAL_STREAMING_WR.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x400000010000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts streaming stores that have any type of= response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.STREAMING_WR.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10800", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts streaming stores that have any type of= response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.STREAMING_WR.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10800", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when Reservation Station (RS) is empty= for the thread.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa5", + "EventName": "RS.EMPTY", + "PublicDescription": "Counts cycles during which the reservation s= tation (RS) is empty for this logical processor. This is usually caused whe= n the front-end pipeline runs into starvation periods (e.g. branch mispredi= ctions or i-cache misses)", + "SampleAfterValue": "1000003", + "UMask": "0x7", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts end of periods where the Reservation S= tation (RS) was empty.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0xa5", + "EventName": "RS.EMPTY_COUNT", + "Invert": "1", + "PublicDescription": "Counts end of periods where the Reservation = Station (RS) was empty. Could be useful to closely sample on front-end late= ncy issues (see the FRONTEND_RETIRED event of designated precise events)", + "SampleAfterValue": "100003", + "UMask": "0x7", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when RS was empty and a resource alloc= ation stall is asserted", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa5", + "EventName": "RS.EMPTY_RESOURCE", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of issue slots in a UMWAIT = or TPAUSE instruction where no uop issues due to the instruction putting th= e CPU into the C0.1 activity state.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x75", + "EventName": "SERIALIZATION.C01_MS_SCB", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots in a UMWAIT = or TPAUSE instruction where no uop issues due to the instruction putting th= e CPU into the C0.1 activity state.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x75", + "EventName": "SERIALIZATION.C01_MS_SCB", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots where no uop= could issue due to an IQ scoreboard that stalls allocation until a specifi= ed older uop retires or (in the case of jump scoreboard) executes. Commonly= executed instructions with IQ scoreboards include LFENCE and MFENCE.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x75", + "EventName": "SERIALIZATION.IQ_JEU_SCB", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles the uncore cannot take further request= s", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x2d", + "EventName": "XQ.FULL", + "PublicDescription": "number of cycles when the thread is active a= nd the uncore cannot take any further requests (for example prefetches, loa= ds or stores initiated by the Core that miss the L2 cache).", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + } +] diff --git a/tools/perf/pmu-events/arch/x86/arrowlake/pipeline.json b/tools= /perf/pmu-events/arch/x86/arrowlake/pipeline.json new file mode 100644 index 000000000000..571ce2d8e4cb --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/arrowlake/pipeline.json @@ -0,0 +1,2307 @@ +[ + { + "BriefDescription": "Counts the number of cycles when any of the f= loating point or integer dividers are active.", + "Counter": "0,1,2,3,4,5,6,7", + "CounterMask": "1", + "EventCode": "0xcd", + "EventName": "ARITH.DIV_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles when divide unit is busy executing div= ide or square root operations.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0xb0", + "EventName": "ARITH.DIV_ACTIVE", + "PublicDescription": "Counts cycles when divide unit is busy execu= ting divide or square root operations. Accounts for integer and floating-po= int operations.", + "SampleAfterValue": "1000003", + "UMask": "0x9", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of cycles when any of the f= loating point or integer dividers are active.", + "Counter": "0,1,2,3,4,5,6,7", + "CounterMask": "1", + "EventCode": "0xcd", + "EventName": "ARITH.DIV_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Cycles when integer divide unit is busy execu= ting divide or square root operations.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0xb0", + "EventName": "ARITH.IDIV_ACTIVE", + "PublicDescription": "Counts cycles when divide unit is busy execu= ting divide or square root operations. Accounts for integer operations only= .", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of occurrences where a microcode assis= t is invoked by hardware.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc1", + "EventName": "ASSISTS.ANY", + "PublicDescription": "Counts the number of occurrences where a mic= rocode assist is invoked by hardware. Examples include AD (page Access Dirt= y), FP and AVX related assists.", + "SampleAfterValue": "100003", + "UMask": "0x1f", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the total number of branch instruction= s retired for all branch types.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.ALL_BRANCHES", + "PublicDescription": "Counts the total number of instructions in w= hich the instruction pointer (IP) of the processor is resteered due to a br= anch instruction and the branch instruction successfully retires. All bran= ch type instructions are accounted for.", + "SampleAfterValue": "200003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "All branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.ALL_BRANCHES", + "PublicDescription": "Counts all branch instructions retired.", + "SampleAfterValue": "400009", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the total number of branch instruction= s retired for all branch types.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.ALL_BRANCHES", + "PublicDescription": "Counts the total number of instructions in w= hich the instruction pointer (IP) of the processor is resteered due to a br= anch instruction and the branch instruction successfully retires. All bran= ch type instructions are accounted for.", + "SampleAfterValue": "200003", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts retired JCC (Jump on Conditional Code)= branch instructions retired includes both taken and not taken branches", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND", + "SampleAfterValue": "200003", + "UMask": "0x7e", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Conditional branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND", + "PublicDescription": "Counts conditional branch instructions retir= ed.", + "SampleAfterValue": "400009", + "UMask": "0x111", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of retired JCC (Jump on Con= ditional Code) branch instructions retired, includes both taken and not tak= en branches.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND", + "SampleAfterValue": "200003", + "UMask": "0x7e", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Not taken branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND_NTAKEN", + "PublicDescription": "Counts not taken branch instructions retired= .", + "SampleAfterValue": "400009", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of taken JCC branch instruc= tions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND_TAKEN", + "SampleAfterValue": "200003", + "UMask": "0xfe", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Taken conditional branch instructions retired= .", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND_TAKEN", + "PublicDescription": "Counts taken conditional branch instructions= retired.", + "SampleAfterValue": "400009", + "UMask": "0x101", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of taken JCC (Jump on Condi= tional Code) branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND_TAKEN", + "SampleAfterValue": "200003", + "UMask": "0xfe", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Taken backward conditional branch instruction= s retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND_TAKEN_BWD", + "PublicDescription": "Counts taken backward conditional branch ins= tructions retired.", + "SampleAfterValue": "400009", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Taken forward conditional branch instructions= retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND_TAKEN_FWD", + "PublicDescription": "Counts taken forward conditional branch inst= ructions retired.", + "SampleAfterValue": "400009", + "UMask": "0x102", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of far branch instructions = retired, includes far jump, far call and return, and Interrupt call and ret= urn", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.FAR_BRANCH", + "SampleAfterValue": "200003", + "UMask": "0xbf", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Far branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.FAR_BRANCH", + "PublicDescription": "Counts far branch instructions retired.", + "SampleAfterValue": "100007", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of far branch instructions = retired, includes far jump, far call and return, and interrupt call and ret= urn.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.FAR_BRANCH", + "SampleAfterValue": "200003", + "UMask": "0xbf", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of near indirect JMP and ne= ar indirect CALL branch instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.INDIRECT", + "SampleAfterValue": "200003", + "UMask": "0xeb", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Indirect near branch instructions retired (ex= cluding returns)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.INDIRECT", + "PublicDescription": "Counts near indirect branch instructions ret= ired excluding returns. TSX abort is an indirect branch.", + "SampleAfterValue": "100003", + "UMask": "0x80", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of near indirect JMP and ne= ar indirect CALL branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.INDIRECT", + "SampleAfterValue": "200003", + "UMask": "0xeb", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of near indirect CALL branc= h instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.INDIRECT_CALL", + "SampleAfterValue": "200003", + "UMask": "0xfb", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of near indirect CALL branc= h instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.INDIRECT_CALL", + "SampleAfterValue": "200003", + "UMask": "0xfb", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of near indirect JMP branch= instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.INDIRECT_JMP", + "SampleAfterValue": "200003", + "UMask": "0xef", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of near CALL branch instruc= tions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.NEAR_CALL", + "SampleAfterValue": "200003", + "UMask": "0xf9", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Direct and indirect near call instructions re= tired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.NEAR_CALL", + "PublicDescription": "Counts both direct and indirect near call in= structions retired.", + "SampleAfterValue": "100007", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of near CALL branch instruc= tions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.NEAR_CALL", + "SampleAfterValue": "200003", + "UMask": "0xf9", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of near RET branch instruct= ions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.NEAR_RETURN", + "SampleAfterValue": "200003", + "UMask": "0xf7", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Return instructions retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.NEAR_RETURN", + "PublicDescription": "Counts return instructions retired.", + "SampleAfterValue": "100007", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of near RET branch instruct= ions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.NEAR_RETURN", + "SampleAfterValue": "200003", + "UMask": "0xf7", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Taken branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.NEAR_TAKEN", + "PublicDescription": "Counts taken branch instructions retired.", + "SampleAfterValue": "400009", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of near taken branch instru= ctions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.NEAR_TAKEN", + "SampleAfterValue": "200003", + "UMask": "0xc0", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of near relative CALL branc= h instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.REL_CALL", + "SampleAfterValue": "200003", + "UMask": "0xfd", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of near relative CALL branc= h instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.REL_CALL", + "SampleAfterValue": "200003", + "UMask": "0xfd", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of near relative JMP branch= instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.REL_JMP", + "SampleAfterValue": "200003", + "UMask": "0xdf", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the total number of mispredicted branc= h instructions retired for all branch types.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", + "PublicDescription": "Counts the total number of mispredicted bran= ch instructions retired. All branch type instructions are accounted for. = Prediction of the branch target address enables the processor to begin exec= uting instructions before the non-speculative execution path is known. The = branch prediction unit (BPU) predicts the target address based on the instr= uction pointer (IP) of the branch and on the execution path through which e= xecution reached this IP. A branch misprediction occurs when the predict= ion is wrong, and results in discarding all instructions executed in the sp= eculative path and re-fetching from the correct path.", + "SampleAfterValue": "200003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "All mispredicted branch instructions retired.= ", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", + "PublicDescription": "Counts all the retired branch instructions t= hat were mispredicted by the processor. A branch misprediction occurs when = the processor incorrectly predicts the destination of the branch. When the= misprediction is discovered at execution, all the instructions executed in= the wrong (speculative) path must be discarded, and the processor must sta= rt fetching from the correct path.", + "SampleAfterValue": "400009", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the total number of mispredicted branc= h instructions retired for all branch types.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", + "PublicDescription": "Counts the total number of mispredicted bran= ch instructions retired. All branch type instructions are accounted for. = Prediction of the branch target address enables the processor to begin exec= uting instructions before the non-speculative execution path is known. The = branch prediction unit (BPU) predicts the target address based on the instr= uction pointer (IP) of the branch and on the execution path through which e= xecution reached this IP. A branch misprediction occurs when the predict= ion is wrong, and results in discarding all instructions executed in the sp= eculative path and re-fetching from the correct path.", + "SampleAfterValue": "200003", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "All mispredicted branch instructions retired.= This precise event may be used to get the misprediction cost via the Retir= e_Latency field of PEBS. It fires on the instruction that immediately follo= ws the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.ALL_BRANCHES_COST", + "SampleAfterValue": "400009", + "UMask": "0x44", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted JCC branch = instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND", + "SampleAfterValue": "200003", + "UMask": "0x7e", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Mispredicted conditional branch instructions = retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND", + "PublicDescription": "Counts mispredicted conditional branch instr= uctions retired.", + "SampleAfterValue": "400009", + "UMask": "0x111", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted JCC (Jump o= n Conditional Code) branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND", + "SampleAfterValue": "200003", + "UMask": "0x7e", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Mispredicted conditional branch instructions = retired. This precise event may be used to get the misprediction cost via t= he Retire_Latency field of PEBS. It fires on the instruction that immediate= ly follows the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_COST", + "SampleAfterValue": "400009", + "UMask": "0x151", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Mispredicted non-taken conditional branch ins= tructions retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_NTAKEN", + "PublicDescription": "Counts the number of conditional branch inst= ructions retired that were mispredicted and the branch direction was not ta= ken.", + "SampleAfterValue": "400009", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Mispredicted non-taken conditional branch ins= tructions retired. This precise event may be used to get the misprediction = cost via the Retire_Latency field of PEBS. It fires on the instruction that= immediately follows the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_NTAKEN_COST", + "SampleAfterValue": "400009", + "UMask": "0x50", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted taken JCC b= ranch instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_TAKEN", + "SampleAfterValue": "200003", + "UMask": "0xfe", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "number of branch instructions retired that we= re mispredicted and taken.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_TAKEN", + "PublicDescription": "Counts taken conditional mispredicted branch= instructions retired.", + "SampleAfterValue": "400009", + "UMask": "0x101", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted taken JCC (= Jump on Conditional Code) branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_TAKEN", + "SampleAfterValue": "200003", + "UMask": "0xfe", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "number of branch instructions retired that we= re mispredicted and taken backward.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_TAKEN_BWD", + "PublicDescription": "Counts taken backward conditional mispredict= ed branch instructions retired.", + "SampleAfterValue": "400009", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "number of branch instructions retired that we= re mispredicted and taken backward. This precise event may be used to get t= he misprediction cost via the Retire_Latency field of PEBS. It fires on the= instruction that immediately follows the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_TAKEN_BWD_COST", + "SampleAfterValue": "400009", + "UMask": "0x8001", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Mispredicted taken conditional branch instruc= tions retired. This precise event may be used to get the misprediction cost= via the Retire_Latency field of PEBS. It fires on the instruction that imm= ediately follows the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_TAKEN_COST", + "SampleAfterValue": "400009", + "UMask": "0x141", + "Unit": "cpu_core" + }, + { + "BriefDescription": "number of branch instructions retired that we= re mispredicted and taken forward.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_TAKEN_FWD", + "PublicDescription": "Counts taken forward conditional mispredicte= d branch instructions retired.", + "SampleAfterValue": "400009", + "Unit": "cpu_core" + }, + { + "BriefDescription": "number of branch instructions retired that we= re mispredicted and taken forward. This precise event may be used to get th= e misprediction cost via the Retire_Latency field of PEBS. It fires on the = instruction that immediately follows the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_TAKEN_FWD_COST", + "SampleAfterValue": "400009", + "UMask": "0x8002", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted near indire= ct JMP and near indirect CALL branch instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT", + "SampleAfterValue": "200003", + "UMask": "0xeb", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Miss-predicted near indirect branch instructi= ons retired (excluding returns)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT", + "PublicDescription": "Counts miss-predicted near indirect branch i= nstructions retired excluding returns. TSX abort is an indirect branch.", + "SampleAfterValue": "100003", + "UMask": "0x80", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted near indire= ct JMP and near indirect CALL branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT", + "SampleAfterValue": "200003", + "UMask": "0xeb", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of mispredicted near indire= ct CALL branch instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT_CALL", + "SampleAfterValue": "200003", + "UMask": "0xfb", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Mispredicted indirect CALL retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT_CALL", + "PublicDescription": "Counts retired mispredicted indirect (near t= aken) CALL instructions, including both register and memory indirect.", + "SampleAfterValue": "400009", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted near indire= ct CALL branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT_CALL", + "SampleAfterValue": "200003", + "UMask": "0xfb", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Mispredicted indirect CALL retired. This prec= ise event may be used to get the misprediction cost via the Retire_Latency = field of PEBS. It fires on the instruction that immediately follows the mis= predicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT_CALL_COST", + "SampleAfterValue": "400009", + "UMask": "0x42", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Mispredicted near indirect branch instruction= s retired (excluding returns). This precise event may be used to get the mi= sprediction cost via the Retire_Latency field of PEBS. It fires on the inst= ruction that immediately follows the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT_COST", + "SampleAfterValue": "100003", + "UMask": "0xc0", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted near indire= ct JMP branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT_JMP", + "SampleAfterValue": "200003", + "UMask": "0xef", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Number of near branch instructions retired th= at were mispredicted and taken.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.NEAR_TAKEN", + "PublicDescription": "Counts number of near branch instructions re= tired that were mispredicted and taken.", + "SampleAfterValue": "400009", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted near taken = branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.NEAR_TAKEN", + "SampleAfterValue": "200003", + "UMask": "0x80", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Mispredicted taken near branch instructions r= etired. This precise event may be used to get the misprediction cost via th= e Retire_Latency field of PEBS. It fires on the instruction that immediatel= y follows the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.NEAR_TAKEN_COST", + "SampleAfterValue": "400009", + "UMask": "0x60", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event counts the number of mispredicted = ret instructions retired. Non PEBS", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.RET", + "PublicDescription": "This is a non-precise version (that is, does= not use PEBS) of the event that counts mispredicted return instructions re= tired.", + "SampleAfterValue": "100007", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted near RET br= anch instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.RETURN", + "SampleAfterValue": "200003", + "UMask": "0xf7", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of mispredicted near RET br= anch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.RETURN", + "SampleAfterValue": "200003", + "UMask": "0xf7", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Mispredicted ret instructions retired. This p= recise event may be used to get the misprediction cost via the Retire_Laten= cy field of PEBS. It fires on the instruction that immediately follows the = mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.RET_COST", + "SampleAfterValue": "100007", + "UMask": "0x48", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Core clocks when the thread is in the C0.1 li= ght-weight slower wakeup time but more power saving optimized state.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xec", + "EventName": "CPU_CLK_UNHALTED.C01", + "PublicDescription": "Counts core clocks when the thread is in the= C0.1 light-weight slower wakeup time but more power saving optimized state= . This state can be entered via the TPAUSE or UMWAIT instructions.", + "SampleAfterValue": "2000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Core clocks when the thread is in the C0.2 li= ght-weight faster wakeup time but less power saving optimized state.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xec", + "EventName": "CPU_CLK_UNHALTED.C02", + "PublicDescription": "Counts core clocks when the thread is in the= C0.2 light-weight faster wakeup time but less power saving optimized state= . This state can be entered via the TPAUSE or UMWAIT instructions.", + "SampleAfterValue": "2000003", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Core clocks when the thread is in the C0.1 or= C0.2 or running a PAUSE in C0 ACPI state.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xec", + "EventName": "CPU_CLK_UNHALTED.C0_WAIT", + "PublicDescription": "Counts core clocks when the thread is in the= C0.1 or C0.2 power saving optimized states (TPAUSE or UMWAIT instructions)= or running the PAUSE instruction.", + "SampleAfterValue": "2000003", + "UMask": "0x70", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of unhalted = core clock cycles.", + "Counter": "Fixed counter 1", + "EventName": "CPU_CLK_UNHALTED.CORE", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Core cycles when the core is not in a halt st= ate.", + "Counter": "Fixed counter 1", + "EventName": "CPU_CLK_UNHALTED.CORE", + "PublicDescription": "Counts the number of core cycles while the c= ore is not in a halt state. The core enters the halt state when it is runni= ng the HLT instruction. This event is a component in many key event ratios.= The core frequency may change from time to time due to transitions associa= ted with Enhanced Intel SpeedStep Technology or TM2. For this reason this e= vent may have a changing ratio with regards to time. When the core frequenc= y is constant, this event can approximate elapsed time while the core was n= ot in the halt state. It is counted on a dedicated fixed counter, leaving t= he programmable counters available for other events.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of unhalted = core clock cycles", + "Counter": "Fixed counter 1", + "EventName": "CPU_CLK_UNHALTED.CORE", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of unhalted core clock cycl= es [This event is alias to CPU_CLK_UNHALTED.THREAD_P]", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.CORE_P", + "SampleAfterValue": "2000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Thread cycles when thread is not in halt stat= e [This event is alias to CPU_CLK_UNHALTED.THREAD_P]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.CORE_P", + "PublicDescription": "This is an architectural event that counts t= he number of thread cycles while the thread is not in a halt state. The thr= ead enters the halt state when it is running the HLT instruction. The core = frequency may change from time to time due to power or thermal throttling. = For this reason, this event may have a changing ratio with regards to wall = clock time. [This event is alias to CPU_CLK_UNHALTED.THREAD_P]", + "SampleAfterValue": "2000003", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of unhalted core clock cycl= es [This event is alias to CPU_CLK_UNHALTED.THREAD_P]", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.CORE_P", + "SampleAfterValue": "2000003", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Core clocks when a PAUSE is pending.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xec", + "EventName": "CPU_CLK_UNHALTED.PAUSE", + "SampleAfterValue": "2000003", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of Pause instructions", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0xec", + "EventName": "CPU_CLK_UNHALTED.PAUSE_INST", + "SampleAfterValue": "2000003", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of unhalted = reference clock cycles.", + "Counter": "Fixed counter 2", + "EventName": "CPU_CLK_UNHALTED.REF_TSC", + "SampleAfterValue": "2000003", + "UMask": "0x3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Reference cycles when the core is not in halt= state.", + "Counter": "Fixed counter 2", + "EventName": "CPU_CLK_UNHALTED.REF_TSC", + "PublicDescription": "Counts the number of reference cycles when t= he core is not in a halt state. The core enters the halt state when it is r= unning the HLT instruction or the MWAIT instruction. This event is not affe= cted by core frequency changes (for example, P states, TM2 transitions) but= has the same incrementing frequency as the time stamp counter. This event = can approximate elapsed time while the core was not in a halt state. Note: = On all current platforms this event stops counting during 'throttling (TM)'= states duty off periods the processor is 'halted'. The counter update is = done at a lower clock rate then the core clock the overflow status bit for = this counter may appear 'sticky'. After the counter has overflowed and sof= tware clears the overflow status bit and resets the counter to less than MA= X. The reset value to the counter is not clocked immediately so the overflo= w status bit will flip 'high (1)' and generate another PMI (if enabled) aft= er which the reset value gets clocked into the counter. Therefore, software= will get the interrupt, read the overflow status bit '1 for bit 34 while t= he counter value is less than MAX. Software should ignore this case.", + "SampleAfterValue": "2000003", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of unhalted = reference clock cycles", + "Counter": "Fixed counter 2", + "EventName": "CPU_CLK_UNHALTED.REF_TSC", + "SampleAfterValue": "2000003", + "UMask": "0x3", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of unhalted reference clock= cycles", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.REF_TSC_P", + "PublicDescription": "Counts the number of reference cycles that t= he core is not in a halt state. The core enters the halt state when it is r= unning the HLT instruction. This event is not affected by core frequency ch= anges and increments at a fixed frequency that is also used for the Time St= amp Counter (TSC). This event uses a programmable general purpose performan= ce counter.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Reference cycles when the core is not in halt= state.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.REF_TSC_P", + "PublicDescription": "Counts the number of reference cycles when t= he core is not in a halt state. The core enters the halt state when it is r= unning the HLT instruction or the MWAIT instruction. This event is not affe= cted by core frequency changes (for example, P states, TM2 transitions) but= has the same incrementing frequency as the time stamp counter. This event = can approximate elapsed time while the core was not in a halt state. Note: = On all current platforms this event stops counting during 'throttling (TM)'= states duty off periods the processor is 'halted'. The counter update is = done at a lower clock rate then the core clock the overflow status bit for = this counter may appear 'sticky'. After the counter has overflowed and sof= tware clears the overflow status bit and resets the counter to less than MA= X. The reset value to the counter is not clocked immediately so the overflo= w status bit will flip 'high (1)' and generate another PMI (if enabled) aft= er which the reset value gets clocked into the counter. Therefore, software= will get the interrupt, read the overflow status bit '1 for bit 34 while t= he counter value is less than MAX. Software should ignore this case.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of unhalted reference clock= cycles at TSC frequency.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.REF_TSC_P", + "PublicDescription": "Counts the number of reference cycles that t= he core is not in a halt state. The core enters the halt state when it is r= unning the HLT instruction. This event is not affected by core frequency ch= anges and increments at a fixed frequency that is also used for the Time St= amp Counter (TSC). This event uses a programmable general purpose performan= ce counter.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of unhalted = core clock cycles.", + "Counter": "Fixed counter 1", + "EventName": "CPU_CLK_UNHALTED.THREAD", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Core cycles when the thread is not in a halt = state.", + "Counter": "Fixed counter 1", + "EventName": "CPU_CLK_UNHALTED.THREAD", + "PublicDescription": "Counts the number of core cycles while the t= hread is not in a halt state. The thread enters the halt state when it is r= unning the HLT instruction. This event is a component in many key event rat= ios. The core frequency may change from time to time due to transitions ass= ociated with Enhanced Intel SpeedStep Technology or TM2. For this reason th= is event may have a changing ratio with regards to time. When the core freq= uency is constant, this event can approximate elapsed time while the core w= as not in the halt state. It is counted on a dedicated fixed counter, leavi= ng the programmable counters available for other events.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of unhalted = core clock cycles", + "Counter": "Fixed counter 1", + "EventName": "CPU_CLK_UNHALTED.THREAD", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of unhalted core clock cycl= es [This event is alias to CPU_CLK_UNHALTED.CORE_P]", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.THREAD_P", + "SampleAfterValue": "2000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Thread cycles when thread is not in halt stat= e [This event is alias to CPU_CLK_UNHALTED.CORE_P]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.THREAD_P", + "PublicDescription": "This is an architectural event that counts t= he number of thread cycles while the thread is not in a halt state. The thr= ead enters the halt state when it is running the HLT instruction. The core = frequency may change from time to time due to power or thermal throttling. = For this reason, this event may have a changing ratio with regards to wall = clock time. [This event is alias to CPU_CLK_UNHALTED.CORE_P]", + "SampleAfterValue": "2000003", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of unhalted core clock cycl= es [This event is alias to CPU_CLK_UNHALTED.CORE_P]", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.THREAD_P", + "SampleAfterValue": "2000003", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Cycles while memory subsystem has an outstand= ing load.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "16", + "EventCode": "0xa3", + "EventName": "CYCLE_ACTIVITY.CYCLES_MEM_ANY", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total execution stalls.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "4", + "EventCode": "0xa3", + "EventName": "CYCLE_ACTIVITY.STALLS_TOTAL", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles total of 1 uop is executed on all port= s and Reservation Station was not empty.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.1_PORTS_UTIL", + "PublicDescription": "Counts cycles during which a total of 1 uop = was executed on all ports and Reservation Station (RS) was not empty.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles total of 2 or 3 uops are executed on a= ll ports and Reservation Station (RS) was not empty.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.2_3_PORTS_UTIL", + "SampleAfterValue": "2000003", + "UMask": "0xc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles total of 2 uops are executed on all po= rts and Reservation Station was not empty.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.2_PORTS_UTIL", + "PublicDescription": "Counts cycles during which a total of 2 uops= were executed on all ports and Reservation Station (RS) was not empty.", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles total of 3 uops are executed on all po= rts and Reservation Station was not empty.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.3_PORTS_UTIL", + "PublicDescription": "Cycles total of 3 uops are executed on all p= orts and Reservation Station (RS) was not empty.", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles total of 4 uops are executed on all po= rts and Reservation Station was not empty.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.4_PORTS_UTIL", + "PublicDescription": "Cycles total of 4 uops are executed on all p= orts and Reservation Station (RS) was not empty.", + "SampleAfterValue": "2000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Execution stalls while memory subsystem has a= n outstanding load.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "5", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.BOUND_ON_LOADS", + "SampleAfterValue": "2000003", + "UMask": "0x21", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles where the Store Buffer was full and no= loads caused an execution stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "2", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.BOUND_ON_STORES", + "PublicDescription": "Counts cycles where the Store Buffer was ful= l and no loads caused an execution stall.", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles no uop executed while RS was not empty= , the SB was not full and there was no outstanding load.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.EXE_BOUND_0_PORTS", + "PublicDescription": "Number of cycles total of 0 uops executed on= all ports, Reservation Station (RS) was not empty, the Store Buffer (SB) w= as not full and there was no outstanding load.", + "SampleAfterValue": "1000003", + "UMask": "0x80", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instruction decoders utilized in a cycle", + "Counter": "2", + "EventCode": "0x75", + "EventName": "INST_DECODED.DECODERS", + "PublicDescription": "Number of decoders utilized in a cycle when = the MITE (legacy decode pipeline) fetches instructions.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of instructi= ons retired.", + "Counter": "Fixed counter 0", + "EventName": "INST_RETIRED.ANY", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of instructions retired. Fixed Counter= - architectural event", + "Counter": "Fixed counter 0", + "EventName": "INST_RETIRED.ANY", + "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of instructi= ons retired", + "Counter": "Fixed counter 0", + "EventName": "INST_RETIRED.ANY", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc0", + "EventName": "INST_RETIRED.ANY_P", + "SampleAfterValue": "2000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of instructions retired. General Count= er - architectural event", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc0", + "EventName": "INST_RETIRED.ANY_P", + "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", + "SampleAfterValue": "2000003", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc0", + "EventName": "INST_RETIRED.ANY_P", + "SampleAfterValue": "2000003", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "retired macro-fused uops when there is a bran= ch in the macro-fused pair (the two instructions that got macro-fused count= once in this pmon)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc0", + "EventName": "INST_RETIRED.BR_FUSED", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "INST_RETIRED.MACRO_FUSED", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc0", + "EventName": "INST_RETIRED.MACRO_FUSED", + "SampleAfterValue": "2000003", + "UMask": "0x30", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired NOP instructions.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc0", + "EventName": "INST_RETIRED.NOP", + "PublicDescription": "Counts all retired NOP or ENDBR32/64 or PREF= ETCHIT0/1 instructions", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Precise instruction retired with PEBS precise= -distribution", + "Counter": "Fixed counter 0", + "EventName": "INST_RETIRED.PREC_DIST", + "PublicDescription": "A version of INST_RETIRED that allows for a = precise distribution of samples across instructions retired. It utilizes th= e Precise Distribution of Instructions Retired (PDIR++) feature to fix bias= in how retired instructions get sampled. Use on Fixed Counter 0.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Iterations of Repeat string retired instructi= ons.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc0", + "EventName": "INST_RETIRED.REP_ITERATION", + "PublicDescription": "Number of iterations of Repeat (REP) string = retired instructions such as MOVS, CMPS, and SCAS. Each has a byte, word, a= nd doubleword version and string instructions can be repeated using a repet= ition prefix, REP, that allows their architectural execution to be repeated= a number of times as specified by the RCX register. Note the number of ite= rations is implementation-dependent.", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Bubble cycles of BPClear.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xad", + "EventName": "INT_MISC.BPCLEAR_CYCLES", + "MSRIndex": "0x3F7", + "MSRValue": "0xB", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Clears speculative count", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0xad", + "EventName": "INT_MISC.CLEARS_COUNT", + "PublicDescription": "Counts the number of speculative clears due = to any type of branch misprediction or machine clears", + "SampleAfterValue": "500009", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts cycles after recovery from a branch mi= sprediction or machine clear till the first uop is issued from the resteere= d path.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xad", + "EventName": "INT_MISC.CLEAR_RESTEER_CYCLES", + "PublicDescription": "Cycles after recovery from a branch mispredi= ction or machine clear till the first uop is issued from the resteered path= .", + "SampleAfterValue": "500009", + "UMask": "0x80", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Core cycles the allocator was stalled due to = recovery from earlier clear event for this thread", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xad", + "EventName": "INT_MISC.RECOVERY_CYCLES", + "PublicDescription": "Counts core cycles when the Resource allocat= or was stalled due to recovery from an earlier branch misprediction or mach= ine clear event.", + "SampleAfterValue": "500009", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Bubble cycles of BAClear (Unknown Branch).", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xad", + "EventName": "INT_MISC.UNKNOWN_BRANCH_CYCLES", + "MSRIndex": "0x3F7", + "MSRValue": "0x7", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "TMA slots where uops got dropped", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xad", + "EventName": "INT_MISC.UOP_DROPPING", + "PublicDescription": "Estimated number of Top-down Microarchitectu= re Analysis slots that got dropped due to non front-end reasons", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of vector integer instructions retired= of 128-bit vector-width.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.128BIT", + "SampleAfterValue": "1000003", + "UMask": "0x13", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of vector integer instructions retired= of 256-bit vector-width.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.256BIT", + "SampleAfterValue": "1000003", + "UMask": "0xac", + "Unit": "cpu_core" + }, + { + "BriefDescription": "integer ADD, SUB, SAD 128-bit vector instruct= ions.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.ADD_128", + "PublicDescription": "Number of retired integer ADD/SUB (regular o= r horizontal), SAD 128-bit vector instructions.", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "integer ADD, SUB, SAD 256-bit vector instruct= ions.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.ADD_256", + "PublicDescription": "Number of retired integer ADD/SUB (regular o= r horizontal), SAD 256-bit vector instructions.", + "SampleAfterValue": "1000003", + "UMask": "0xc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "INT_VEC_RETIRED.MUL_256", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.MUL_256", + "SampleAfterValue": "1000003", + "UMask": "0x80", + "Unit": "cpu_core" + }, + { + "BriefDescription": "INT_VEC_RETIRED.SHUFFLES", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.SHUFFLES", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "INT_VEC_RETIRED.VNNI_128", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.VNNI_128", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "INT_VEC_RETIRED.VNNI_256", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.VNNI_256", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of retired loads that are b= locked because it initially appears to be store forward blocked, but subseq= uently is shown not to be blocked based on 4K alias check.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.ADDRESS_ALIAS", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "False dependencies in MOB due to partial comp= are on address.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.ADDRESS_ALIAS", + "PublicDescription": "Counts the number of times a load got blocke= d due to false dependencies in MOB due to partial compare on address.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of retired loads that are b= locked because it initially appears to be store forward blocked, but subseq= uently is shown not to be blocked based on 4K alias check.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.ADDRESS_ALIAS", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of occurrences a retired lo= ad gets blocked because its address exactly matches an older store whose da= ta is not ready (a.k.a. unknown). unready_fwd", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.DATA_UNKNOWN", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired loads that are b= locked because its address exactly matches an older store whose data is not= ready.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.DATA_UNKNOWN", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "The number of times that split load operation= s are temporarily blocked because all resources for handling the split acce= sses are in use.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.NO_SR", + "PublicDescription": "Counts the number of times that split load o= perations are temporarily blocked because all resources for handling the sp= lit accesses are in use.", + "SampleAfterValue": "100003", + "UMask": "0x88", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of occurrences a retired lo= ad gets blocked because its address partially overlaps with an older store = (size mismatch) - unknown_sta/bad_forward", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.STORE_FORWARD", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Loads blocked due to overlapping with a prece= ding store that cannot be forwarded.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.STORE_FORWARD", + "PublicDescription": "Counts the number of times where store forwa= rding was prevented for a load operation. The most common case is a load bl= ocked due to the address of memory access (partially) overlapping with a pr= eceding uncompleted store. Note: See the table of not supported store forwa= rds in the Optimization Guide.", + "SampleAfterValue": "100003", + "UMask": "0x82", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of retired loads that are b= locked because its address partially overlapped with an older store.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.STORE_FORWARD", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Cycles Uops delivered by the LSD, but didn't = come from the decoder.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0xa8", + "EventName": "LSD.CYCLES_ACTIVE", + "PublicDescription": "Counts the cycles when at least one uop is d= elivered by the LSD (Loop-stream detector).", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles optimal number of Uops delivered by th= e LSD, but did not come from the decoder.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "8", + "EventCode": "0xa8", + "EventName": "LSD.CYCLES_OK", + "PublicDescription": "Counts the cycles when optimal number of uop= s is delivered by the LSD (Loop-stream detector).", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of Uops delivered by the LSD.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa8", + "EventName": "LSD.UOPS", + "PublicDescription": "Counts the number of uops delivered to the b= ack-end by the LSD(Loop Stream Detector).", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts all machine clears for any reason incl= uding, but not limited to memory ordering, SMC, and FP assist.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.ANY", + "SampleAfterValue": "20003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of machine clears (nukes) of any type.= ", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.COUNT", + "PublicDescription": "Counts the number of machine clears (nukes) = of any type.", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of memory ordering machine = clears triggered due to an internal load passing an older store within the = same CPU.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.DISAMBIGUATION", + "SampleAfterValue": "20003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears due to me= mory ordering in which an internal load passes an older store within the sa= me CPU.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.DISAMBIGUATION", + "SampleAfterValue": "20003", + "UMask": "0x8", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of nukes due to memory rena= ming", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.MRN_NUKE", + "SampleAfterValue": "20003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of times that the machine c= lears due to a page fault. Covers both I-Side and D-Side (Loads/Stores) pa= ge faults. A page fault occurs when either the page is not present, or an = access violation.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.PAGE_FAULT", + "SampleAfterValue": "20003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears due to a = page fault. Counts both I-Side and D-Side (Loads/Stores) page faults. A p= age fault occurs when either the page is not present, or an access violatio= n occurs.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.PAGE_FAULT", + "SampleAfterValue": "20003", + "UMask": "0x20", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of machine clears that flus= h the pipeline and restart the machine with the use of microcode due to SMC= , MEMORY_ORDERING, FP_ASSISTS, PAGE_FAULT, DISAMBIGUATION, and FPC_VIRTUAL_= TRAP.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.SLOW", + "SampleAfterValue": "20003", + "UMask": "0x6e", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears that flus= h the pipeline and restart the machine with the use of microcode due to SMC= , MEMORY_ORDERING, FP_ASSISTS, PAGE_FAULT, DISAMBIGUATION, and FPC_VIRTUAL_= TRAP.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.SLOW", + "SampleAfterValue": "20003", + "UMask": "0x6f", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Self-modifying code (SMC) detected.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.SMC", + "PublicDescription": "Counts self-modifying code (SMC) detected, w= hich causes a machine clear.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of machine clears due to pr= ogram modifying data (self modifying code) within 1K of a recently fetched = code page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.SMC", + "SampleAfterValue": "20003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "LFENCE instructions retired", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe0", + "EventName": "MISC2_RETIRED.LFENCE", + "PublicDescription": "number of LFENCE retired instructions", + "SampleAfterValue": "400009", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "LBR record is inserted", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe4", + "EventName": "MISC_RETIRED.LBR_INSERTS", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of Last Branch Record (LBR)= entries. Requires LBRs to be enabled and configured in IA32_LBR_CTL. [This= event is alias to LBR_INSERTS.ANY]", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe4", + "EventName": "MISC_RETIRED.LBR_INSERTS", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots not consumed= by the backend due to a micro-sequencer (MS) scoreboard, which stalls the = front-end from issuing from the UROM until a specified older uop retires.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x75", + "EventName": "SERIALIZATION.NON_C01_MS_SCB", + "PublicDescription": "Counts the number of issue slots not consume= d by the backend due to a micro-sequencer (MS) scoreboard, which stalls the= front-end from issuing from the UROM until a specified older uop retires. = The most commonly executed instruction with an MS scoreboard is PAUSE.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This event counts a subset of the Topdown Slo= ts event that were not consumed by the back-end pipeline due to lack of bac= k-end resources, as a result of memory subsystem delays, execution units li= mitations, or other conditions.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa4", + "EventName": "TOPDOWN.BACKEND_BOUND_SLOTS", + "PublicDescription": "This event counts a subset of the Topdown Sl= ots event that were not consumed by the back-end pipeline due to lack of ba= ck-end resources, as a result of memory subsystem delays, execution units l= imitations, or other conditions. Software can use this event as the numerat= or for the Backend Bound metric (or top-level category) of the Top-down Mic= roarchitecture Analysis method.", + "SampleAfterValue": "10000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "TMA slots wasted due to incorrect speculation= s.", + "Counter": "0", + "EventCode": "0xa4", + "EventName": "TOPDOWN.BAD_SPEC_SLOTS", + "PublicDescription": "Number of slots of TMA method that were wast= ed due to incorrect speculation. It covers all types of control-flow or dat= a-related mis-speculations.", + "SampleAfterValue": "10000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "TMA slots wasted due to incorrect speculation= by branch mispredictions", + "Counter": "0", + "EventCode": "0xa4", + "EventName": "TOPDOWN.BR_MISPREDICT_SLOTS", + "PublicDescription": "Number of TMA slots that were wasted due to = incorrect speculation by (any type of) branch mispredictions. This event es= timates number of speculative operations that were issued but not retired a= s well as the out-of-order engine recovery past a branch misprediction.", + "SampleAfterValue": "10000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "TOPDOWN.MEMORY_BOUND_SLOTS", + "Counter": "3", + "EventCode": "0xa4", + "EventName": "TOPDOWN.MEMORY_BOUND_SLOTS", + "SampleAfterValue": "10000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "TMA slots available for an unhalted logical p= rocessor. Fixed counter - architectural event", + "Counter": "Fixed counter 3", + "EventName": "TOPDOWN.SLOTS", + "PublicDescription": "Number of available slots for an unhalted lo= gical processor. The event increments by machine-width of the narrowest pip= eline as employed by the Top-down Microarchitecture Analysis method (TMA). = Software can use this event as the denominator for the top-level metrics of= the TMA method. This architectural event is counted on a designated fixed = counter (Fixed Counter 3).", + "SampleAfterValue": "10000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "TMA slots available for an unhalted logical p= rocessor. General counter - architectural event", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa4", + "EventName": "TOPDOWN.SLOTS_P", + "PublicDescription": "Counts the number of available slots for an = unhalted logical processor. The event increments by machine-width of the na= rrowest pipeline as employed by the Top-down Microarchitecture Analysis met= hod.", + "SampleAfterValue": "10000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend because allocation is stalled due to a mispredict= ed jump or a machine clear.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x73", + "EventName": "TOPDOWN_BAD_SPECULATION.ALL_P", + "SampleAfterValue": "1000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend because allocation is stalled due to a mispredict= ed jump or a machine clear.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x73", + "EventName": "TOPDOWN_BAD_SPECULATION.ALL_P", + "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend because allocation is stalled due to a mis= predicted jump or a machine clear. Only issue slots wasted due to fast nuke= s such as memory ordering nukes are counted. Other nukes are not accounted = for. Counts all issue slots blocked during this recovery window, including = relevant microcode flows, and while uops are not yet available in the instr= uction queue (IQ) or until an FE_BOUND event occurs besides OTHER and CISC.= Also includes the issue slots that were consumed by the backend but were t= hrown away because they were younger than the mispredict or machine clear.", + "SampleAfterValue": "1000003", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to Fast Nukes such as Memory Ord= ering Machine clears and MRN nukes", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x73", + "EventName": "TOPDOWN_BAD_SPECULATION.FASTNUKE", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to Fast Nukes such as Memory Ord= ering Machine clears and MRN nukes", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x73", + "EventName": "TOPDOWN_BAD_SPECULATION.FASTNUKE", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend because allocation is stalled due to a mach= ine clear (nuke) of any kind including memory ordering and memory disambigu= ation.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x73", + "EventName": "TOPDOWN_BAD_SPECULATION.MACHINE_CLEARS", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend because allocation is stalled due to a mach= ine clear (nuke) of any kind including memory ordering and memory disambigu= ation.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x73", + "EventName": "TOPDOWN_BAD_SPECULATION.MACHINE_CLEARS", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to Branch Mispredict", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x73", + "EventName": "TOPDOWN_BAD_SPECULATION.MISPREDICT", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to Branch Mispredict", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x73", + "EventName": "TOPDOWN_BAD_SPECULATION.MISPREDICT", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to a machine clear (nuke).", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x73", + "EventName": "TOPDOWN_BAD_SPECULATION.NUKE", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to a machine clear (nuke).", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x73", + "EventName": "TOPDOWN_BAD_SPECULATION.NUKE", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of retirement slots not con= sumed due to backend stalls [This event is alias to TOPDOWN_BE_BOUND.ALL_P]= ", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xa4", + "EventName": "TOPDOWN_BE_BOUND.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to due to certain allocation rest= rictions", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to due to certain allocation rest= rictions", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of retirement slots not con= sumed due to backend stalls [This event is alias to TOPDOWN_BE_BOUND.ALL]", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xa4", + "EventName": "TOPDOWN_BE_BOUND.ALL_P", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retirement slots not con= sumed due to backend stalls", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.ALL_P", + "SampleAfterValue": "1000003", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to memory reservation stall (sche= duler not being able to accept another uop). This could be caused by RSV f= ull or load/store buffer block.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.MEM_SCHEDULER", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to memory reservation stall (sche= duler not being able to accept another uop). This could be caused by RSV f= ull or load/store buffer block.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.MEM_SCHEDULER", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to IEC and FPC RAT stalls - which= can be due to the FIQ and IEC reservation station stall (integer, FP and S= IMD scheduler not being able to accept another uop. )", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to IEC and FPC RAT stalls - which= can be due to the FIQ and IEC reservation station stall (integer, FP and S= IMD scheduler not being able to accept another uop. )", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to mrbl stall. A 'marble' refers= to a physical register file entry, also known as the physical destination = (PDST).", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.REGISTER", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to mrbl stall. A 'marble' refers= to a physical register file entry, also known as the physical destination = (PDST).", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.REGISTER", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to ROB full", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.REORDER_BUFFER", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to ROB full", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.REORDER_BUFFER", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to iq/jeu scoreboards or ms scb", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.SERIALIZATION", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to iq/jeu scoreboards or ms scb", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.SERIALIZATION", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of retiremen= t slots not consumed due to front end stalls.", + "Counter": "37", + "EventName": "TOPDOWN_FE_BOUND.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x6", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retirement slots not con= sumed due to front end stalls", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x9c", + "EventName": "TOPDOWN_FE_BOUND.ALL_P", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retirement slots not con= sumed due to front end stalls", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.ALL_P", + "SampleAfterValue": "1000003", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to BAClear", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.BRANCH_DETECT", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to BAClear", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.BRANCH_DETECT", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to BTClear", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.BRANCH_RESTEER", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to BTClear", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.BRANCH_RESTEER", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to ms", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.CISC", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to ms", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.CISC", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to decode stall", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.DECODE", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to decode stall", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.DECODE", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to frontend bandwidth restricti= ons due to decode, predecode, cisc, and other limitations.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH", + "SampleAfterValue": "1000003", + "UMask": "0x8d", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to frontend bandwidth restricti= ons due to decode, predecode, cisc, and other limitations.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH", + "SampleAfterValue": "1000003", + "UMask": "0x8d", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to latency related stalls inclu= ding BACLEARs, BTCLEARs, ITLB misses, and ICache misses.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.FRONTEND_LATENCY", + "SampleAfterValue": "1000003", + "UMask": "0x72", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to latency related stalls inclu= ding BACLEARs, BTCLEARs, ITLB misses, and ICache misses.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.FRONTEND_LATENCY", + "SampleAfterValue": "1000003", + "UMask": "0x72", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "This event is deprecated. [This event is alia= s to TOPDOWN_FE_BOUND.ITLB_MISS]", + "Counter": "0,1,2,3,4,5,6,7", + "Deprecated": "1", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.ITLB", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to itlb miss", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.ITLB_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to itlb miss [This event is ali= as to TOPDOWN_FE_BOUND.ITLB]", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.ITLB_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend that do not categorize into any oth= er common frontend stall", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.OTHER", + "SampleAfterValue": "1000003", + "UMask": "0x80", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend that do not categorize into any oth= er common frontend stall", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.OTHER", + "SampleAfterValue": "1000003", + "UMask": "0x80", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to predecode wrong", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.PREDECODE", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to predecode wrong", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.PREDECODE", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of consumed = retirement slots.", + "Counter": "38", + "EventName": "TOPDOWN_RETIRING.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x7", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of consumed retirement slot= s.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc2", + "EventName": "TOPDOWN_RETIRING.ALL_P", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of consumed retirement slot= s.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x72", + "EventName": "TOPDOWN_RETIRING.ALL_P", + "SampleAfterValue": "1000003", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Number of non dec-by-all uops decoded by deco= der", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x76", + "EventName": "UOPS_DECODED.DEC0_UOPS", + "PublicDescription": "This event counts the number of not dec-by-a= ll uops decoded by decoder 0.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Uops executed on INT EU ALU ports.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.ALU", + "PublicDescription": "Number of ALU integer uops dispatch to execu= tion.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Uops executed on any INT EU ports", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.INT_EU_ALL", + "PublicDescription": "Number of integer uops dispatched to executi= on.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of Uops dispatched/executed by any of = the 3 JEUs (all ups that hold the JEU including macro; micro jumps; fetch-f= rom-eip)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.JMP", + "PublicDescription": "Number of jump uops dispatch to execution", + "SampleAfterValue": "2000003", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Uops executed on Load ports", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.LOAD", + "PublicDescription": "Number of Load uops dispatched to execution.= ", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of (shift) 1-cycle Uops dispatched/exe= cuted by any of the Shift Eus", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.SHIFT", + "PublicDescription": "Number of SHIFT integer uops dispatch to exe= cution", + "SampleAfterValue": "2000003", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of Uops dispatched/executed by Slow EU= (e.g. 3+ cycles LEA, >1 cycles shift, iDIVs, CR; *H operation)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.SLOW", + "PublicDescription": "Number of Slow integer uops dispatch to exec= ution.", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of Uops dispatched on STA ports", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.STA", + "PublicDescription": "Number of STA (Store Address) uops dispatch = to execution", + "SampleAfterValue": "2000003", + "UMask": "0x80", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Uops executed on STD ports", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.STD", + "PublicDescription": "Number of STD (Store Data) uops dispatch to = execution", + "SampleAfterValue": "2000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles where at least 1 uop was executed per-= thread", + "Counter": "3", + "CounterMask": "1", + "EventCode": "0xb1", + "EventName": "UOPS_EXECUTED.CYCLES_GE_1", + "PublicDescription": "Cycles where at least 1 uop was executed per= -thread.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles where at least 2 uops were executed pe= r-thread", + "Counter": "3", + "CounterMask": "2", + "EventCode": "0xb1", + "EventName": "UOPS_EXECUTED.CYCLES_GE_2", + "PublicDescription": "Cycles where at least 2 uops were executed p= er-thread.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles where at least 3 uops were executed pe= r-thread", + "Counter": "3", + "CounterMask": "3", + "EventCode": "0xb1", + "EventName": "UOPS_EXECUTED.CYCLES_GE_3", + "PublicDescription": "Cycles where at least 3 uops were executed p= er-thread.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles where at least 4 uops were executed pe= r-thread", + "Counter": "3", + "CounterMask": "4", + "EventCode": "0xb1", + "EventName": "UOPS_EXECUTED.CYCLES_GE_4", + "PublicDescription": "Cycles where at least 4 uops were executed p= er-thread.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts number of cycles no uops were dispatch= ed to be executed on this thread.", + "Counter": "3", + "CounterMask": "1", + "EventCode": "0xb1", + "EventName": "UOPS_EXECUTED.STALLS", + "Invert": "1", + "PublicDescription": "Counts cycles during which no uops were disp= atched from the Reservation Station (RS) per thread.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of uops to be executed per-= thread each cycle.", + "Counter": "3", + "EventCode": "0xb1", + "EventName": "UOPS_EXECUTED.THREAD", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of x87 uops dispatched.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb1", + "EventName": "UOPS_EXECUTED.X87", + "PublicDescription": "Counts the number of x87 uops executed.", + "SampleAfterValue": "2000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "When 4-uops are requested and only 2-uops are= delivered, the event counts 2. Uops_issued correlates to the number of RO= B entries. If uop takes 2 ROB slots it counts as 2 uops_issued.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x0e", + "EventName": "UOPS_ISSUED.ANY", + "SampleAfterValue": "200003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Uops that RAT issues to RS", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xae", + "EventName": "UOPS_ISSUED.ANY", + "PublicDescription": "Counts the number of uops that the Resource = Allocation Table (RAT) issues to the Reservation Station (RS).", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of uops issued by the front= end every cycle.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x0e", + "EventName": "UOPS_ISSUED.ANY", + "PublicDescription": "Counts the number of uops issued by the fron= t end every cycle. When 4-uops are requested and only 2-uops are delivered,= the event counts 2. Uops_issued correlates to the number of ROB entries. = If uop takes 2 ROB slots it counts as 2 uops_issued.", + "SampleAfterValue": "1000003", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "UOPS_ISSUED.CYCLES", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0xae", + "EventName": "UOPS_ISSUED.CYCLES", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the total number of uops retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.ALL", + "SampleAfterValue": "2000003", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Cycles with retired uop(s).", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.CYCLES", + "PublicDescription": "Counts cycles where at least one uop has ret= ired.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired uops except the last uop of each inst= ruction.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.HEAVY", + "PublicDescription": "Counts the number of retired micro-operation= s (uops) except the last uop of each instruction. An instruction that is de= coded into less than two uops does not contribute to the count.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of integer divide uops reti= red", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.IDIV", + "SampleAfterValue": "2000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of integer divide uops reti= red.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.IDIV", + "SampleAfterValue": "2000003", + "UMask": "0x10", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of uops that are from the c= omplex flows issued by the micro-sequencer (MS). This includes uops from f= lows due to complex instructions, faults, assists, and inserted flows.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.MS", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "UOPS_RETIRED.MS", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.MS", + "MSRIndex": "0x3F7", + "MSRValue": "0x8", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of uops that are from the c= omplex flows issued by the micro-sequencer (MS). This includes uops from f= lows due to complex instructions, faults, assists, and inserted flows.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.MS", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Number of non-speculative switches to the Mic= rocode Sequencer (MS)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.MS_SWITCHES", + "MSRIndex": "0x3F7", + "MSRValue": "0x8", + "PublicDescription": "Switches to the Microcode Sequencer", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event counts a subset of the Topdown Slo= ts event that are utilized by operations that eventually get retired (commi= tted) by the processor pipeline. Usually, this event positively correlates = with higher performance for example, as measured by the instructions-per-c= ycle metric.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.SLOTS", + "PublicDescription": "This event counts a subset of the Topdown Sl= ots event that are utilized by operations that eventually get retired (comm= itted) by the processor pipeline. Usually, this event positively correlates= with higher performance for example, as measured by the instructions-per-= cycle metric. Software can use this event as the numerator for the Retiring= metric (or top-level category) of the Top-down Microarchitecture Analysis = method.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles without actually retired uops.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.STALLS", + "Invert": "1", + "PublicDescription": "This event counts cycles without actually re= tired uops.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of x87 uops retired, includ= es those in ms flows", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.X87", + "SampleAfterValue": "2000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of x87 uops retired, includ= es those in ms flows", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.X87", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + } +] diff --git a/tools/perf/pmu-events/arch/x86/arrowlake/uncore-cache.json b/t= ools/perf/pmu-events/arch/x86/arrowlake/uncore-cache.json new file mode 100644 index 000000000000..f294852dfbe6 --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/arrowlake/uncore-cache.json @@ -0,0 +1,20 @@ +[ + { + "BriefDescription": "Number of all entries allocated. Includes als= o retries.", + "Counter": "0,1", + "EventCode": "0x35", + "EventName": "UNC_HAC_CBO_TOR_ALLOCATION.ALL", + "PerPkg": "1", + "UMask": "0x8", + "Unit": "HAC_CBO" + }, + { + "BriefDescription": "Asserted on coherent DRD + DRdPref allocatio= ns into the queue. Cacheable only", + "Counter": "0,1", + "EventCode": "0x35", + "EventName": "UNC_HAC_CBO_TOR_ALLOCATION.DRD", + "PerPkg": "1", + "UMask": "0x1", + "Unit": "HAC_CBO" + } +] diff --git a/tools/perf/pmu-events/arch/x86/arrowlake/uncore-interconnect.j= son b/tools/perf/pmu-events/arch/x86/arrowlake/uncore-interconnect.json new file mode 100644 index 000000000000..15971f0c71db --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/arrowlake/uncore-interconnect.json @@ -0,0 +1,47 @@ +[ + { + "BriefDescription": "Number of all coherent Data Read entries. Doe= sn't include prefetches", + "Counter": "0,1", + "EventCode": "0x81", + "EventName": "UNC_HAC_ARB_REQ_TRK_REQUEST.DRD", + "PerPkg": "1", + "UMask": "0x2", + "Unit": "HAC_ARB" + }, + { + "BriefDescription": "Number of all CMI transactions", + "Counter": "0,1", + "EventCode": "0x8A", + "EventName": "UNC_HAC_ARB_TRANSACTIONS.ALL", + "PerPkg": "1", + "UMask": "0x1", + "Unit": "HAC_ARB" + }, + { + "BriefDescription": "Number of all CMI reads", + "Counter": "0,1", + "EventCode": "0x8A", + "EventName": "UNC_HAC_ARB_TRANSACTIONS.READS", + "PerPkg": "1", + "UMask": "0x2", + "Unit": "HAC_ARB" + }, + { + "BriefDescription": "Number of all CMI writes not including Mflush= ", + "Counter": "0,1", + "EventCode": "0x8A", + "EventName": "UNC_HAC_ARB_TRANSACTIONS.WRITES", + "PerPkg": "1", + "UMask": "0x4", + "Unit": "HAC_ARB" + }, + { + "BriefDescription": "Total number of all outgoing entries allocate= d. Accounts for Coherent and non-coherent traffic.", + "Counter": "0,1", + "EventCode": "0x81", + "EventName": "UNC_HAC_ARB_TRK_REQUESTS.ALL", + "PerPkg": "1", + "UMask": "0x1", + "Unit": "HAC_ARB" + } +] diff --git a/tools/perf/pmu-events/arch/x86/arrowlake/uncore-memory.json b/= tools/perf/pmu-events/arch/x86/arrowlake/uncore-memory.json new file mode 100644 index 000000000000..783a4f7fd05b --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/arrowlake/uncore-memory.json @@ -0,0 +1,142 @@ +[ + { + "BriefDescription": "Counts every CAS read command sent from the M= emory Controller 0 to DRAM (sum of all channels).", + "Counter": "0", + "EventCode": "0xff", + "EventName": "UNC_MC0_RDCAS_COUNT_FREERUN", + "PerPkg": "1", + "PublicDescription": "Counts every CAS read command sent from the = Memory Controller 0 to DRAM (sum of all channels). Each CAS commands can be= for 32B or 64B of data.", + "UMask": "0x20", + "Unit": "imc_free_running_0" + }, + { + "BriefDescription": "Counts every read and write request entering = the Memory Controller 0.", + "Counter": "2", + "EventCode": "0xff", + "EventName": "UNC_MC0_TOTAL_REQCOUNT_FREERUN", + "PerPkg": "1", + "PublicDescription": "Counts every read and write request entering= the Memory Controller 0 (sum of all channels). All requests are counted as= one, whether they are 32B or 64B Read/Write or partial/full line writes. S= ome write requests to the same address may merge to a single write command = to DRAM. Therefore, the total request count may be higher than total DRAM B= W.", + "UMask": "0x10", + "Unit": "imc_free_running_0" + }, + { + "BriefDescription": "Counts every CAS write command sent from the = Memory Controller 0 to DRAM (sum of all channels).", + "Counter": "1", + "EventCode": "0xff", + "EventName": "UNC_MC0_WRCAS_COUNT_FREERUN", + "PerPkg": "1", + "PublicDescription": "Counts every CAS write command sent from the= Memory Controller 0 to DRAM (sum of all channels). Each CAS commands can = be for 32B or 64B of data.", + "UMask": "0x30", + "Unit": "imc_free_running_0" + }, + { + "BriefDescription": "Counts every CAS read command sent from the M= emory Controller 1 to DRAM (sum of all channels).", + "Counter": "3", + "EventCode": "0xff", + "EventName": "UNC_MC1_RDCAS_COUNT_FREERUN", + "PerPkg": "1", + "PublicDescription": "Counts every CAS read command sent from the = Memory Controller 1 to DRAM (sum of all channels). Each CAS commands can be= for 32B or 64B of data.", + "UMask": "0x20", + "Unit": "imc_free_running_1" + }, + { + "BriefDescription": "Counts every read and write request entering = the Memory Controller 1.", + "Counter": "5", + "EventCode": "0xff", + "EventName": "UNC_MC1_TOTAL_REQCOUNT_FREERUN", + "PerPkg": "1", + "PublicDescription": "Counts every read and write request entering= the Memory Controller 1 (sum of all channels). All requests are counted as= one, whether they are 32B or 64B Read/Write or partial/full line writes. S= ome write requests to the same address may merge to a single write command = to DRAM. Therefore, the total request count may be higher than total DRAM B= W.", + "UMask": "0x10", + "Unit": "imc_free_running_1" + }, + { + "BriefDescription": "Counts every CAS write command sent from the = Memory Controller 1 to DRAM (sum of all channels).", + "Counter": "4", + "EventCode": "0xff", + "EventName": "UNC_MC1_WRCAS_COUNT_FREERUN", + "PerPkg": "1", + "PublicDescription": "Counts every CAS write command sent from the= Memory Controller 1 to DRAM (sum of all channels). Each CAS commands can = be for 32B or 64B of data.", + "UMask": "0x30", + "Unit": "imc_free_running_1" + }, + { + "BriefDescription": "ACT command for a read request sent to DRAM", + "Counter": "0,1,2,3,4", + "EventCode": "0x24", + "EventName": "UNC_M_ACT_COUNT_RD", + "PerPkg": "1", + "Unit": "iMC" + }, + { + "BriefDescription": "ACT command sent to DRAM", + "Counter": "0,1,2,3,4", + "EventCode": "0x26", + "EventName": "UNC_M_ACT_COUNT_TOTAL", + "PerPkg": "1", + "Unit": "iMC" + }, + { + "BriefDescription": "ACT command for a write request sent to DRAM", + "Counter": "0,1,2,3,4", + "EventCode": "0x25", + "EventName": "UNC_M_ACT_COUNT_WR", + "PerPkg": "1", + "Unit": "iMC" + }, + { + "BriefDescription": "Read CAS command sent to DRAM", + "Counter": "0,1,2,3,4", + "EventCode": "0x22", + "EventName": "UNC_M_CAS_COUNT_RD", + "PerPkg": "1", + "Unit": "iMC" + }, + { + "BriefDescription": "Write CAS command sent to DRAM", + "Counter": "0,1,2,3,4", + "EventCode": "0x23", + "EventName": "UNC_M_CAS_COUNT_WR", + "PerPkg": "1", + "Unit": "iMC" + }, + { + "BriefDescription": "PRE command sent to DRAM due to page table id= le timer expiration", + "Counter": "0,1,2,3,4", + "EventCode": "0x28", + "EventName": "UNC_M_PRE_COUNT_IDLE", + "PerPkg": "1", + "Unit": "iMC" + }, + { + "BriefDescription": "PRE command sent to DRAM for a read/write req= uest", + "Counter": "0,1,2,3,4", + "EventCode": "0x27", + "EventName": "UNC_M_PRE_COUNT_PAGE_MISS", + "PerPkg": "1", + "Unit": "iMC" + }, + { + "BriefDescription": "Number of bytes read from DRAM, in 32B chunks= . Counter increments by 1 after receiving 32B chunk data.", + "Counter": "0,1,2,3,4", + "EventCode": "0x3A", + "EventName": "UNC_M_RD_DATA", + "PerPkg": "1", + "Unit": "iMC" + }, + { + "BriefDescription": "Total number of read and write byte transfers= to/from DRAM, in 32B chunks. Counter increments by 1 after sending or rece= iving 32B chunk data.", + "Counter": "0,1,2,3,4", + "EventCode": "0x3C", + "EventName": "UNC_M_TOTAL_DATA", + "PerPkg": "1", + "Unit": "iMC" + }, + { + "BriefDescription": "Number of bytes written to DRAM, in 32B chunk= s. Counter increments by 1 after sending 32B chunk data.", + "Counter": "0,1,2,3,4", + "EventCode": "0x3B", + "EventName": "UNC_M_WR_DATA", + "PerPkg": "1", + "Unit": "iMC" + } +] diff --git a/tools/perf/pmu-events/arch/x86/arrowlake/uncore-other.json b/t= ools/perf/pmu-events/arch/x86/arrowlake/uncore-other.json new file mode 100644 index 000000000000..1ac5b5ef8094 --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/arrowlake/uncore-other.json @@ -0,0 +1,10 @@ +[ + { + "BriefDescription": "This 48-bit fixed counter counts the UCLK cyc= les.", + "Counter": "FIXED", + "EventCode": "0xff", + "EventName": "UNC_CLOCK.SOCKET", + "PerPkg": "1", + "Unit": "CLOCK" + } +] diff --git a/tools/perf/pmu-events/arch/x86/arrowlake/virtual-memory.json b= /tools/perf/pmu-events/arch/x86/arrowlake/virtual-memory.json new file mode 100644 index 000000000000..a3e4a4f3ab45 --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/arrowlake/virtual-memory.json @@ -0,0 +1,522 @@ +[ + { + "BriefDescription": "Counts the number of page walks initiated by = a demand load that missed the first and second level TLBs.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.MISS_CAUSED_WALK", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of first level TLB misses b= ut second level hits due to a demand load that did not start a page walk. A= ccounts for all page sizes. Will result in a DTLB write from STLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.STLB_HIT", + "SampleAfterValue": "200003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Loads that miss the DTLB and hit the STLB.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x12", + "EventName": "DTLB_LOAD_MISSES.STLB_HIT", + "PublicDescription": "Counts loads that miss the DTLB (Data TLB) a= nd hit the STLB (Second level TLB).", + "SampleAfterValue": "100003", + "UMask": "0x320", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of first level TLB misses b= ut second level hits due to a demand load that did not start a page walk. A= ccounts for all page sizes. Will result in a DTLB write from STLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.STLB_HIT", + "SampleAfterValue": "200003", + "UMask": "0x20", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Cycles when at least one PMH is busy with a p= age walk for a demand load.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x12", + "EventName": "DTLB_LOAD_MISSES.WALK_ACTIVE", + "PublicDescription": "Counts cycles when at least one PMH (Page Mi= ss Handler) is busy with a page walk for a demand load.", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Load miss in all TLB levels causes a page wal= k that completes. (All page sizes)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x12", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED", + "PublicDescription": "Counts completed page walks (all page sizes= ) caused by demand data loads. This implies it missed in the DTLB and furth= er levels of TLB. The page walk can end with or without a fault.", + "SampleAfterValue": "100003", + "UMask": "0xe", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to load DTLB misses.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED", + "SampleAfterValue": "200003", + "UMask": "0xe", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Page walks completed due to a demand data loa= d to a 1G page.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x12", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_1G", + "PublicDescription": "Counts completed page walks (1G sizes) caus= ed by demand data loads. This implies address translations missed in the DT= LB and further levels of TLB. The page walk can end with or without a fault= .", + "SampleAfterValue": "100003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to load DTLB misses to a 2M or 4M page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M", + "PublicDescription": "Counts the number of page walks completed du= e to loads (including SW prefetches) whose address translations missed in a= ll Translation Lookaside Buffer (TLB) levels and were mapped to 2M or 4M pa= ges. Includes page walks that page fault.", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Page walks completed due to a demand data loa= d to a 2M/4M page.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x12", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M", + "PublicDescription": "Counts completed page walks (2M/4M sizes) c= aused by demand data loads. This implies address translations missed in the= DTLB and further levels of TLB. The page walk can end with or without a fa= ult.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to load DTLB misses to a 2M or 4M page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M", + "PublicDescription": "Counts the number of page walks completed du= e to loads (including SW prefetches) whose address translations missed in a= ll Translation Lookaside Buffer (TLB) levels and were mapped to 2M or 4M pa= ges. Includes page walks that page fault.", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to load DTLB misses to a 4K page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_4K", + "PublicDescription": "Counts the number of page walks completed du= e to loads (including SW prefetches) whose address translations missed in a= ll Translation Lookaside Buffer (TLB) levels and were mapped to 4K pages. I= ncludes page walks that page fault.", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Page walks completed due to a demand data loa= d to a 4K page.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x12", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_4K", + "PublicDescription": "Counts completed page walks (4K sizes) caus= ed by demand data loads. This implies address translations missed in the DT= LB and further levels of TLB. The page walk can end with or without a fault= .", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to load DTLB misses to a 4K page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_4K", + "PublicDescription": "Counts the number of page walks completed du= e to loads (including SW prefetches) whose address translations missed in a= ll Translation Lookaside Buffer (TLB) levels and were mapped to 4K pages. I= ncludes page walks that page fault.", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of page walks outstanding f= or Loads (demand or SW prefetch) in PMH every cycle.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.WALK_PENDING", + "PublicDescription": "Counts the number of page walks outstanding = for Loads (demand or SW prefetch) in PMH every cycle. A PMH page walk is o= utstanding from page walk start till PMH becomes idle again (ready to serve= next walk). Includes EPT-walk intervals.", + "SampleAfterValue": "200003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of page walks outstanding for a demand= load in the PMH each cycle.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x12", + "EventName": "DTLB_LOAD_MISSES.WALK_PENDING", + "PublicDescription": "Counts the number of page walks outstanding = for a demand load in the PMH (Page Miss Handler) each cycle.", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks outstanding f= or Loads (demand or SW prefetch) in PMH every cycle.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.WALK_PENDING", + "PublicDescription": "Counts the number of page walks outstanding = for Loads (demand or SW prefetch) in PMH every cycle. A PMH page walk is o= utstanding from page walk start till PMH becomes idle again (ready to serve= next walk). Includes EPT-walk intervals.", + "SampleAfterValue": "200003", + "UMask": "0x10", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of page walks initiated by = a store that missed the first and second level TLBs.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.MISS_CAUSED_WALK", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of first level TLB misses b= ut second level hits due to stores that did not start a page walk. Accounts= for all page sizes. Will result in a DTLB write from STLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.STLB_HIT", + "PublicDescription": "Counts the number of first level TLB misses = but second level hits due to a demand load that did not start a page walk. = Accounts for all page sizes. Will result in a DTLB write from STLB.", + "SampleAfterValue": "2000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Stores that miss the DTLB and hit the STLB.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x13", + "EventName": "DTLB_STORE_MISSES.STLB_HIT", + "PublicDescription": "Counts stores that miss the DTLB (Data TLB) = and hit the STLB (2nd Level TLB).", + "SampleAfterValue": "100003", + "UMask": "0x320", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of first level TLB misses b= ut second level hits due to stores that did not start a page walk. Accounts= for all pages sizes. Will result in a DTLB write from STLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.STLB_HIT", + "SampleAfterValue": "2000003", + "UMask": "0x20", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Cycles when at least one PMH is busy with a p= age walk for a store.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x13", + "EventName": "DTLB_STORE_MISSES.WALK_ACTIVE", + "PublicDescription": "Counts cycles when at least one PMH (Page Mi= ss Handler) is busy with a page walk for a store.", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Store misses in all TLB levels causes a page = walk that completes. (All page sizes)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x13", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED", + "PublicDescription": "Counts completed page walks (all page sizes= ) caused by demand data stores. This implies it missed in the DTLB and furt= her levels of TLB. The page walk can end with or without a fault.", + "SampleAfterValue": "100003", + "UMask": "0xe", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to store DTLB misses to a 1G page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED", + "SampleAfterValue": "2000003", + "UMask": "0xe", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Page walks completed due to a demand data sto= re to a 1G page.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x13", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_1G", + "PublicDescription": "Counts completed page walks (1G sizes) caus= ed by demand data stores. This implies address translations missed in the D= TLB and further levels of TLB. The page walk can end with or without a faul= t.", + "SampleAfterValue": "100003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Page walks completed due to a demand data sto= re to a 2M/4M page.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x13", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_2M_4M", + "PublicDescription": "Counts completed page walks (2M/4M sizes) c= aused by demand data stores. This implies address translations missed in th= e DTLB and further levels of TLB. The page walk can end with or without a f= ault.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to store DTLB misses to a 2M or 4M page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_2M_4M", + "PublicDescription": "Counts the number of page walks completed du= e to stores whose address translations missed in all Translation Lookaside = Buffer (TLB) levels and were mapped to 2M or 4M pages. Includes page walks= that page fault.", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to store DTLB misses to a 4K page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_4K", + "PublicDescription": "Counts the number of page walks completed du= e to stores whose address translations missed in all Translation Lookaside = Buffer (TLB) levels and were mapped to 4K pages. Includes page walks that = page fault.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Page walks completed due to a demand data sto= re to a 4K page.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x13", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_4K", + "PublicDescription": "Counts completed page walks (4K sizes) caus= ed by demand data stores. This implies address translations missed in the D= TLB and further levels of TLB. The page walk can end with or without a faul= t.", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to store DTLB misses to a 4K page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_4K", + "PublicDescription": "Counts the number of page walks completed du= e to stores whose address translations missed in all Translation Lookaside = Buffer (TLB) levels and were mapped to 4K pages. Includes page walks that = page fault.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of page walks outstanding i= n the page miss handler (PMH) for stores every cycle.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.WALK_PENDING", + "PublicDescription": "Counts the number of page walks outstanding = in the page miss handler (PMH) for stores every cycle. A PMH page walk is o= utstanding from page walk start till PMH becomes idle again (ready to serve= next walk). Includes EPT-walk intervals.", + "SampleAfterValue": "200003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of page walks outstanding for a store = in the PMH each cycle.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x13", + "EventName": "DTLB_STORE_MISSES.WALK_PENDING", + "PublicDescription": "Counts the number of page walks outstanding = for a store in the PMH (Page Miss Handler) each cycle.", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks outstanding i= n the page miss handler (PMH) for stores every cycle.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.WALK_PENDING", + "PublicDescription": "Counts the number of page walks outstanding = in the page miss handler (PMH) for stores every cycle. A PMH page walk is o= utstanding from page walk start till PMH becomes idle again (ready to serve= next walk). Includes EPT-walk intervals.", + "SampleAfterValue": "200003", + "UMask": "0x10", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of page walks initiated by = a instruction fetch that missed the first and second level TLBs.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.MISS_CAUSED_WALK", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of page walks initiated by = a instruction fetch that missed the first and second level TLBs.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.MISS_CAUSED_WALK", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of first level TLB misses b= ut second level hits due to an instruction fetch that did not start a page = walk. Account for all pages sizes. Will result in an ITLB write from STLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.STLB_HIT", + "SampleAfterValue": "2000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instruction fetch requests that miss the ITLB= and hit the STLB.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x11", + "EventName": "ITLB_MISSES.STLB_HIT", + "PublicDescription": "Counts instruction fetch requests that miss = the ITLB (Instruction TLB) and hit the STLB (Second-level TLB).", + "SampleAfterValue": "100003", + "UMask": "0x120", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of first level TLB misses b= ut second level hits due to an instruction fetch that did not start a page = walk. Account for all pages sizes. Will result in an ITLB write from STLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.STLB_HIT", + "SampleAfterValue": "2000003", + "UMask": "0x20", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Cycles when at least one PMH is busy with a p= age walk for code (instruction fetch) request.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x11", + "EventName": "ITLB_MISSES.WALK_ACTIVE", + "PublicDescription": "Counts cycles when at least one PMH (Page Mi= ss Handler) is busy with a page walk for a code (instruction fetch) request= .", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to instruction fetch misses to any page size.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.WALK_COMPLETED", + "PublicDescription": "Counts the number of page walks completed du= e to instruction fetches whose address translations missed in all Translati= on Lookaside Buffer (TLB) levels and were mapped to any page size. Include= s page walks that page fault.", + "SampleAfterValue": "2000003", + "UMask": "0xe", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Code miss in all TLB levels causes a page wal= k that completes. (All page sizes)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x11", + "EventName": "ITLB_MISSES.WALK_COMPLETED", + "PublicDescription": "Counts completed page walks (all page sizes)= caused by a code fetch. This implies it missed in the ITLB (Instruction TL= B) and further levels of TLB. The page walk can end with or without a fault= .", + "SampleAfterValue": "100003", + "UMask": "0xe", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to instruction fetch misses to any page size.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.WALK_COMPLETED", + "PublicDescription": "Counts the number of page walks completed du= e to instruction fetches whose address translations missed in all Translati= on Lookaside Buffer (TLB) levels and were mapped to any page size. Include= s page walks that page fault.", + "SampleAfterValue": "200003", + "UMask": "0xe", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to instruction fetch misses to a 2M or 4M page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.WALK_COMPLETED_2M_4M", + "PublicDescription": "Counts the number of page walks completed du= e to instruction fetches whose address translations missed in all Translati= on Lookaside Buffer (TLB) levels and were mapped to 2M or 4M pages. Includ= es page walks that page fault.", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Code miss in all TLB levels causes a page wal= k that completes. (2M/4M)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x11", + "EventName": "ITLB_MISSES.WALK_COMPLETED_2M_4M", + "PublicDescription": "Counts completed page walks (2M/4M page size= s) caused by a code fetch. This implies it missed in the ITLB (Instruction = TLB) and further levels of TLB. The page walk can end with or without a fau= lt.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to instruction fetch misses to a 2M or 4M page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.WALK_COMPLETED_2M_4M", + "PublicDescription": "Counts the number of page walks completed du= e to instruction fetches whose address translations missed in all Translati= on Lookaside Buffer (TLB) levels and were mapped to 2M or 4M pages. Includ= es page walks that page fault.", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to instruction fetch misses to a 4K page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.WALK_COMPLETED_4K", + "PublicDescription": "Counts the number of page walks completed du= e to instruction fetches whose address translations missed in all Translati= on Lookaside Buffer (TLB) levels and were mapped to 4K pages. Includes pag= e walks that page fault.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Code miss in all TLB levels causes a page wal= k that completes. (4K)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x11", + "EventName": "ITLB_MISSES.WALK_COMPLETED_4K", + "PublicDescription": "Counts completed page walks (4K page sizes) = caused by a code fetch. This implies it missed in the ITLB (Instruction TLB= ) and further levels of TLB. The page walk can end with or without a fault.= ", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to instruction fetch misses to a 4K page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.WALK_COMPLETED_4K", + "PublicDescription": "Counts the number of page walks completed du= e to instruction fetches whose address translations missed in all Translati= on Lookaside Buffer (TLB) levels and were mapped to 4K pages. Includes pag= e walks that page fault.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of page walks outstanding f= or iside in PMH every cycle.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.WALK_PENDING", + "PublicDescription": "Counts the number of page walks outstanding = for iside in PMH every cycle. A PMH page walk is outstanding from page wal= k start till PMH becomes idle again (ready to serve next walk). Includes EP= T-walk intervals. Walks could be counted by edge detecting on this event, = but would count restarted suspended walks.", + "SampleAfterValue": "200003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of page walks outstanding for an outst= anding code request in the PMH each cycle.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x11", + "EventName": "ITLB_MISSES.WALK_PENDING", + "PublicDescription": "Counts the number of page walks outstanding = for an outstanding code (instruction fetch) request in the PMH (Page Miss H= andler) each cycle.", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks outstanding f= or iside in PMH every cycle.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.WALK_PENDING", + "PublicDescription": "Counts the number of page walks outstanding = for iside in PMH every cycle. A PMH page walk is outstanding from page wal= k start till PMH becomes idle again (ready to serve next walk). Includes EP= T-walk intervals. Walks could be counted by edge detecting on this event, = but would count restarted suspended walks.", + "SampleAfterValue": "200003", + "UMask": "0x10", + "Unit": "cpu_lowpower" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to a DTLB= miss.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.DTLB_MISS_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0x90", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to a DTLB= miss.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.DTLB_MISS_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0x90", + "Unit": "cpu_lowpower" + } +] diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 2fa5529ee585..3cae9f55bd9f 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -1,6 +1,7 @@ Family-model,Version,Filename,EventType GenuineIntel-6-(97|9A|B7|BA|BF),v1.28,alderlake,core GenuineIntel-6-BE,v1.28,alderlaken,core +GenuineIntel-6-C[56],v1.06,arrowlake,core GenuineIntel-6-(1C|26|27|35|36),v5,bonnell,core GenuineIntel-6-(3D|47),v29,broadwell,core GenuineIntel-6-56,v11,broadwellde,core --=20 2.48.0.rc2.279.g1de40edade-goog From nobody Sun Dec 14 13:49:51 2025 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AB3511946AA for ; Thu, 16 Jan 2025 06:44:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009871; cv=none; b=P8V5ViZPQcH9z6HMP582uubXYDtE+6XRKqOYr+8AWp82ijrbKfEgZqmzPQeYZPPiRuzvprq+C+FY7UMqI1UlZ66Tg2W6g+3HR+RTuyp/sEw45gy7iVEO1jMvNLiJZWNqkwNQvuW3rUYxsP7a+iXt3G+1rxCpsUGsJ721EobJRLk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009871; c=relaxed/simple; bh=+02Wj4VvgHJXk2XHKM8ADZcmNy1qgEki+uVcyOIcB80=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=aX6fgncVZq7UYD/XXzpy9b848HNk8A09ZCH1tFnjfzksg7kTvnEoMi3eJqZ6xei5BQq4uzwLhRwL4DUMjpr/TaO60DD4InFAAwioxrC35iUCGvmLU/phbat76VZVFGFBfqezdlIA9YgF55uS0GesuoEqJdx83yHvSTpIpLptgV8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=oH6OQ5YS; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="oH6OQ5YS" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e572cd106f7so1637092276.3 for ; Wed, 15 Jan 2025 22:44:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737009853; x=1737614653; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=BhRranwxSqHiYkUHbIIBuUXy9V+7oycsTLs9Bqq1QEU=; b=oH6OQ5YSfjTqYPV2A1WUDpD3BUu8ahFOX3xOWH8LReuUiklOH6uZvv1wqYQtdOmVx4 wN7889l9ihI5zBZnzutf2qAhi9gj23pJDxZgBQdGRzs8qWyO38Waf7z5QWGQmIDxyuLL OkfpMcYQden8o1evG8t9DzUoC0S9ATNqRf7ca0jaT6XhovtA5kl8eUC3gcIhTiq+Jlzo e+LepcftCz7dBFxQE/XluYQWIfuv6teEUqZlu0BIWi9G2Qk6D/ecVh77tBcGdaIs4Fd5 l+BDglk6+9TRPcw3gzZFfM4uePwyyeoC6WQ14/1Yx2q24iBZI+h+B2WdQFTdsrPtBptV kbqw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737009853; x=1737614653; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=BhRranwxSqHiYkUHbIIBuUXy9V+7oycsTLs9Bqq1QEU=; b=cZGuEl0tMh7YtjBlZKXWwJll5Ma8vnf55xXaqdN+uGLAjAuqV43OW5tbVPXEk41pmV 4C9G5tkujb1EvS2bQMJKEcjbBfKRSIQ0i4yAPDN+vZIZvlSZ4VB44bRIQGvuKBslAFkK Ns4dWaJiLy5JJGVtfu0z02aHusnRJCJirhjSyu1pkh7yYcsddAn/eFoGR0IlxQn8y93+ XjAZwZLpfHkGNSm4b9GfOOjazzHOKHLnxuQtzjMGgb9o7Hm9LHpT4O+33yERGvvQ0KTG f3+UWruxZ52S10Xxe8/6+vC1BM0GxJ7iRmR2Za5oq1wzsbExip6ExF0GG/lVFaxuJ7id tW+w== X-Forwarded-Encrypted: i=1; AJvYcCUJKx85DynG7rGBZ7d6GJefvx1F8g6Fo/SQ86k5fTPBQZ7Azf/SmnaMvbzmbs9QRNB3J38xm1wSprU+QQQ=@vger.kernel.org X-Gm-Message-State: AOJu0YwgpcJ6R7pL10Pl2stWKOFDGihnmFb07p2/0iAZ/9lzgd2s7K4r kDRiQRTDtvQL8lkaJy3l4dQoWFShwwShuAkbOuopAN3hPeuhzddymEePQ/swDWgC1oq0D9Y1eyN FU+LhCQ== X-Google-Smtp-Source: AGHT+IFtjIyQaGI/xHN7awe81+vYTQl71Uk58Njt/mJff3NLA4bu0TVYc+jHyMzgkuSEjsBzOT2P4yfbnON0 X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:7bd:ce19:c701:bfa3]) (user=irogers job=sendgmr) by 2002:a25:a486:0:b0:e57:5037:5b64 with SMTP id 3f1490d57ef6-e575037630fmr41123276.5.1737009853084; Wed, 15 Jan 2025 22:44:13 -0800 (PST) Date: Wed, 15 Jan 2025 22:43:36 -0800 In-Reply-To: <20250116064355.344245-1-irogers@google.com> Message-Id: <20250116064355.344245-5-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250116064355.344245-1-irogers@google.com> X-Mailer: git-send-email 2.48.0.rc2.279.g1de40edade-goog Subject: [PATCH v2 04/23] perf vendor events: Update Broadwell events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v29 to v30. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v30: https://github.com/intel/perfmon/commit/9a1827b2ac3927a455ae7df5aa3d1e1b10e= 69f15 The TMA 5.01 update is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-developed-by: Caleb Biggers Signed-off-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../arch/x86/broadwell/bdw-metrics.json | 312 ++++++++++-------- .../pmu-events/arch/x86/broadwell/cache.json | 10 +- .../arch/x86/broadwell/frontend.json | 4 +- .../pmu-events/arch/x86/broadwell/memory.json | 8 +- .../arch/x86/broadwell/metricgroups.json | 5 + .../arch/x86/broadwell/pipeline.json | 10 +- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 7 files changed, 192 insertions(+), 159 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/broadwell/bdw-metrics.json b/to= ols/perf/pmu-events/arch/x86/broadwell/bdw-metrics.json index af620553f958..40970fa5566c 100644 --- a/tools/perf/pmu-events/arch/x86/broadwell/bdw-metrics.json +++ b/tools/perf/pmu-events/arch/x86/broadwell/bdw-metrics.json @@ -74,12 +74,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", + "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT= .PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_= info_thread_slots", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", @@ -92,8 +92,8 @@ "MetricExpr": "66 * OTHER_ASSISTS.ANY_WB_ASSIST / tma_info_thread_= slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y_WB_ASSIST", "ScaleUnit": "100%" }, { @@ -104,7 +104,7 @@ "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound.", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound", "ScaleUnit": "100%" }, { @@ -114,7 +114,7 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%" }, { @@ -125,7 +125,7 @@ "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_info_bad_spec_branch_misprediction_cost, tma_mispredicts_resteers", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_mispredicts_resteers", "ScaleUnit": "100%" }, { @@ -133,8 +133,8 @@ "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS= .COUNT + BACLEARS.ANY) / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -143,8 +143,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", "ScaleUnit": "100%" }, { @@ -152,7 +152,7 @@ "MetricExpr": "MACHINE_CLEARS.COUNT * tma_branch_resteers / (BR_MI= SP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Rel= ated metrics: tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tm= a_ms_switches", "ScaleUnit": "100%" }, @@ -160,10 +160,10 @@ "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(60 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM * (1 = + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_= UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS= _L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LO= AD_UOPS_RETIRED.L3_MISS))) + 43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS *= (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_L= OAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_= UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + ME= M_LOAD_UOPS_RETIRED.L3_MISS)))) / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MIS= S. Related metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears= ", "ScaleUnit": "100%" }, { @@ -174,7 +174,7 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", "ScaleUnit": "100%" }, { @@ -183,8 +183,8 @@ "MetricExpr": "43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT * (1 + = MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UO= PS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L= 3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD= _UOPS_RETIRED.L3_MISS))) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_UOPS_L3_HIT_RETIRED.XSNP_HIT. Related metrics: tma_contested_accesses, t= ma_false_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { @@ -192,8 +192,8 @@ "MetricExpr": "ARITH.FPU_DIV_ACTIVE / tma_info_core_core_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", - "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_UOPS", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.FPU_DIV_ACTIVE", "ScaleUnit": "100%" }, { @@ -202,8 +202,8 @@ "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_= RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS)) * CYCLE_ACTIVITY.STALL= S_L2_MISS / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RE= TIRED.L3_MISS_PS", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RE= TIRED.L3_MISS", "ScaleUnit": "100%" }, { @@ -212,7 +212,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", "ScaleUnit": "100%" }, { @@ -220,45 +220,45 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Related metrics: tma_fetch_bandw= idth, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + cpu@DTLB_LOAD_MISS= ES.WALK_DURATION\\,cmask\\=3D1@ + 7 * DTLB_LOAD_MISSES.WALK_COMPLETED) / tm= a_info_thread_clks", + "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + cpu@DTLB_LOAD_MISS= ES.WALK_DURATION\\,cmask\\=3D0x1@ + 7 * DTLB_LOAD_MISSES.WALK_COMPLETED) / = tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(8 * DTLB_STORE_MISSES.STLB_HIT + cpu@DTLB_STORE_MI= SSES.WALK_DURATION\\,cmask\\=3D1@ + 7 * DTLB_STORE_MISSES.WALK_COMPLETED) /= tma_info_thread_clks", + "MetricExpr": "(8 * DTLB_STORE_MISSES.STLB_HIT + cpu@DTLB_STORE_MI= SSES.WALK_DURATION\\,cmask\\=3D0x1@ + 7 * DTLB_STORE_MISSES.WALK_COMPLETED)= / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_UOPS_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_UOPS_RETIRED.STLB_MISS_STORES. Related metrics: tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", "MetricExpr": "60 * OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM = / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related= metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_UOPS_L3= _HIT_RETIRED.XSNP_HITM, OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Rela= ted metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D1@ / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D0x1@ / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_system_dram_bw_use, tma_mem_ba= ndwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_system_dram_bw_use, tma_mem_ba= ndwidth, tma_sq_full, tma_store_latency", "ScaleUnit": "100%" }, { @@ -287,7 +287,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", "ScaleUnit": "100%" }, { @@ -295,8 +295,8 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports= _utilized_2", "ScaleUnit": "100%" }, { @@ -304,8 +304,8 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -313,8 +313,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_p= ort_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -322,8 +322,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_port_0, tma_port_1, tma_port_5, tma_p= ort_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -333,33 +333,33 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound.", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", "MetricExpr": "tma_microcode_sequencer", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses.", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", "MetricExpr": "ICACHE.IFDATA_STALL / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", "MetricExpr": "tma_info_inst_mix_instructions / (UOPS_RETIRED.RETI= RE_SLOTS / UOPS_ISSUED.ANY * BR_MISP_EXEC.INDIRECT)", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" }, { "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", @@ -370,7 +370,7 @@ }, { "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTE= D.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CP= U_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_thread_clks))", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tm= a_info_thread_clks)", "MetricGroup": "SMT", "MetricName": "tma_info_core_core_clks" }, @@ -391,11 +391,11 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED.VECTOR) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -414,7 +414,13 @@ "MetricName": "tma_info_frontend_ipunknown_branch" }, { - "BriefDescription": "Branch instructions per taken branch.", + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, + { + "BriefDescription": "Branch instructions per taken branch", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -432,7 +438,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -440,7 +446,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -448,7 +454,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -456,7 +462,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -464,7 +470,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -506,7 +512,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 9", + "MetricThreshold": "tma_info_inst_mix_iptb < 4 * 2 + 1", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_frontend_dsb_coverage, = tma_lcp" }, { @@ -529,7 +535,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw" }, @@ -541,7 +547,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw" }, @@ -583,7 +589,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw" }, @@ -602,7 +608,7 @@ { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency" }, { @@ -628,26 +634,26 @@ }, { "BriefDescription": "Utilization of the core's Page Walker(s) serv= ing STLB misses triggered by instruction/Load/Store accesses", - "MetricExpr": "(cpu@ITLB_MISSES.WALK_DURATION\\,cmask\\=3D1@ + cpu= @DTLB_LOAD_MISSES.WALK_DURATION\\,cmask\\=3D1@ + cpu@DTLB_STORE_MISSES.WALK= _DURATION\\,cmask\\=3D1@ + 7 * (DTLB_STORE_MISSES.WALK_COMPLETED + DTLB_LOA= D_MISSES.WALK_COMPLETED + ITLB_MISSES.WALK_COMPLETED)) / tma_info_core_core= _clks", + "MetricExpr": "(cpu@ITLB_MISSES.WALK_DURATION\\,cmask\\=3D0x1@ + c= pu@DTLB_LOAD_MISSES.WALK_DURATION\\,cmask\\=3D0x1@ + cpu@DTLB_STORE_MISSES.= WALK_DURATION\\,cmask\\=3D0x1@ + 7 * (DTLB_STORE_MISSES.WALK_COMPLETED + DT= LB_LOAD_MISSES.WALK_COMPLETED + ITLB_MISSES.WALK_COMPLETED)) / tma_info_cor= e_core_clks", "MetricGroup": "Mem;MemoryTLB", "MetricName": "tma_info_memory_tlb_page_walks_utilization", "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (cpu@UOPS_EXECUTED.CORE\\,cm= ask\\=3D1@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC)", + "MetricExpr": "UOPS_EXECUTED.THREAD / (cpu@UOPS_EXECUTED.CORE\\,cm= ask\\=3D0x1@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -665,14 +671,14 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / duration_time / 1e3", + "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / tma_info_system_time / 1e3", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_mem_bandwidth,= tma_sq_full" }, { "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / duration_time", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / tma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" @@ -682,13 +688,14 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi" + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -697,6 +704,19 @@ "MetricName": "tma_info_system_kernel_utilization", "MetricThreshold": "tma_info_system_kernel_utilization > 0.05" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "power@energy\\-pkg@ * 61 / (tma_info_system_time * = 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" + }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_= UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", @@ -709,6 +729,13 @@ "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -716,7 +743,7 @@ "MetricName": "tma_info_system_turbo_utilization" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -725,14 +752,15 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi" + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -758,24 +786,24 @@ "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TA= KEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 6" + "MetricThreshold": "tma_info_thread_uptb < 4 * 1.5" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", - "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + cpu@ITLB_MISSES.WALK_D= URATION\\,cmask\\=3D1@ + 7 * ITLB_MISSES.WALK_COMPLETED) / tma_info_thread_= clks", + "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + cpu@ITLB_MISSES.WALK_D= URATION\\,cmask\\=3D0x1@ + 7 * ITLB_MISSES.WALK_COMPLETED) / tma_info_threa= d_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: ITLB_M= ISSES.WALK_COMPLETED", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT_PS;MEM_LOAD_UOPS_= RETIRED.HIT_LFB_PS. Related metrics: tma_clears_resteers, tma_machine_clear= s, tma_microcode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT. Related metri= cs: tma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_m= s_switches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { @@ -783,8 +811,8 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.ST= ALLS_L2_MISS) / tma_info_thread_clks", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_UOPS_RETIRED.L2_HIT_PS", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_UOPS_RETIRED.L2_HIT", "ScaleUnit": "100%" }, { @@ -793,8 +821,8 @@ "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIR= ED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS) * CYCLE_ACTIVITY.STALLS_L2_M= ISS / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT", "ScaleUnit": "100%" }, { @@ -803,8 +831,8 @@ "MetricExpr": "29 * (MEM_LOAD_UOPS_RETIRED.L3_HIT * (1 + MEM_LOAD_= UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRE= D.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RET= IRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RET= IRED.L3_MISS))) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS. Related metric= s: tma_mem_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT. Related metrics: = tma_branch_resteers, tma_mem_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -812,18 +840,18 @@ "MetricExpr": "ILD_STALL.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_frontend_dsb_coverage,= tma_info_inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", "MetricExpr": "tma_retiring - tma_heavy_operations", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", "ScaleUnit": "100%" }, { @@ -840,10 +868,10 @@ "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_= STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_W= ITH_DEMAND_RFO) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOA= DS_PS. Related metrics: tma_store_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, { @@ -854,15 +882,15 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_system_dram_bw_u= se, tma_sq_full", "ScaleUnit": "100%" }, @@ -871,7 +899,7 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_l3_hit_latency", "ScaleUnit": "100%" }, @@ -883,7 +911,7 @@ "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", "ScaleUnit": "100%" }, { @@ -900,8 +928,8 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES * tma_branch_resteers = / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Related metrics: tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Related metrics: tma_branch_mispredicts", "ScaleUnit": "100%" }, { @@ -910,7 +938,7 @@ "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", "MetricName": "tma_mite", "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck", "ScaleUnit": "100%" }, { @@ -918,8 +946,8 @@ "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mix= ing_vectors, tma_serializing_operation", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer", "ScaleUnit": "100%" }, { @@ -928,7 +956,7 @@ "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_= scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vecto= r_512b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_= scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_1, = tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -937,7 +965,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vect= or_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tm= a_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vect= or_128b, tma_fp_vector_256b, tma_port_0, tma_port_5, tma_port_6, tma_ports_= utilized_2", "ScaleUnit": "100%" }, { @@ -973,7 +1001,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_5", "MetricThreshold": "tma_port_5 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar,= tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_5. Related metrics: tma_fp_sc= alar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tm= a_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -982,7 +1010,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_6. Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector= _512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_1. Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, t= ma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1000,43 +1028,43 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES= _GE_1_UOP_EXEC - (UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if tma_info_thread_ip= c > 1.8 else UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) - (RS_EVENTS.EMPTY_CYCLES= if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB - RESOURCE_STALLS.= SB - CYCLE_ACTIVITY.STALLS_MEM_ANY) / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\,cmask\\=3D1@ / 2 if= #SMT_on else (CYCLE_ACTIVITY.STALLS_TOTAL - (RS_EVENTS.EMPTY_CYCLES if tma= _fetch_latency > 0.1 else 0)) / tma_info_core_core_clks)", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\=3D0x1\\,cmask\\=3D0= x1@ / 2 if #SMT_on else CYCLE_ACTIVITY.STALLS_TOTAL - (RS_EVENTS.EMPTY_CYCL= ES if tma_fetch_latency > 0.1 else 0)) / tma_info_core_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D2@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_= 1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) / tma_info_core_core_clks= )", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@ - cpu@UOP= S_EXECUTED.CORE\\,cmask\\=3D0x2@) / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_= GE_1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) / tma_info_core_core_c= lks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D3@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_= 2_UOPS_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_clk= s)", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@ - cpu@UOP= S_EXECUTED.CORE\\,cmask\\=3D0x3@) / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_= GE_2_UOPS_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_= clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port= _6", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise).", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@ / 2 if #SMT_= on else UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_clks", + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@ / 2 if #SM= T_on else UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1055,8 +1083,8 @@ "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS_PS", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS", "ScaleUnit": "100%" }, { @@ -1064,16 +1092,16 @@ "MetricExpr": "2 * MEM_UOPS_RETIRED.SPLIT_STORES / tma_info_core_c= ore_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_UOPS_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_UOPS_RETIRED.SPLIT_STORES. Related metrics: tma_port_4", "ScaleUnit": "100%" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on els= e OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_core_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_system_dram_bw_use, tma_mem_bandwidth", "ScaleUnit": "100%" }, @@ -1082,8 +1110,8 @@ "MetricExpr": "RESOURCE_STALLS.SB / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_UOPS_RETIRED.ALL_STORES_PS", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_UOPS_RETIRED.ALL_STORES", "ScaleUnit": "100%" }, { @@ -1091,18 +1119,18 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_= LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / M= EM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS= _OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -1118,7 +1146,7 @@ "MetricExpr": "tma_branch_resteers - tma_mispredicts_resteers - tm= a_clears_resteers", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, @@ -1127,8 +1155,8 @@ "MetricExpr": "INST_RETIRED.X87 * tma_info_thread_uoppi / UOPS_RET= IRED.RETIRE_SLOTS", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", "ScaleUnit": "100%" } ] diff --git a/tools/perf/pmu-events/arch/x86/broadwell/cache.json b/tools/pe= rf/pmu-events/arch/x86/broadwell/cache.json index 063ec8c2b2a1..fd3df87318c5 100644 --- a/tools/perf/pmu-events/arch/x86/broadwell/cache.json +++ b/tools/perf/pmu-events/arch/x86/broadwell/cache.json @@ -22,7 +22,7 @@ "Counter": "2", "EventCode": "0x48", "EventName": "L1D_PEND_MISS.PENDING", - "PublicDescription": "This event counts duration of L1D miss outst= anding, that is each cycle number of Fill Buffers (FB) outstanding required= by Demand Reads. FB either is held by demand loads, or it is held by non-d= emand loads and gets hit at least once by demand. The valid outstanding int= erval is defined until the FB deallocation by one of the following ways: fr= om FB allocation, if FB is allocated by demand; from the demand Hit FB, if = it is allocated by hardware or software prefetch.\nNote: In the L1D, a Dema= nd Read contains cacheable or noncacheable demand loads, including ones cau= sing cache-line splits and reads due to page walks resulted from any reques= t type.", + "PublicDescription": "This event counts duration of L1D miss outst= anding, that is each cycle number of Fill Buffers (FB) outstanding required= by Demand Reads. FB either is held by demand loads, or it is held by non-d= emand loads and gets hit at least once by demand. The valid outstanding int= erval is defined until the FB deallocation by one of the following ways: fr= om FB allocation, if FB is allocated by demand; from the demand Hit FB, if = it is allocated by hardware or software prefetch. Note: In the L1D, a Deman= d Read contains cacheable or noncacheable demand loads, including ones caus= ing cache-line splits and reads due to page walks resulted from any request= type.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -401,7 +401,7 @@ "EventCode": "0xD1", "EventName": "MEM_LOAD_UOPS_RETIRED.HIT_LFB", "PEBS": "1", - "PublicDescription": "This event counts retired load uops which da= ta sources were load uops missed L1 but hit a fill buffer due to a precedin= g miss to the same cache line with the data not ready.\nNote: Only two data= -sources of L1/FB are applicable for AVX-256bit even though the correspond= ing AVX load could be serviced by a deeper level in the memory hierarchy. D= ata source is reported for the Low-half load.", + "PublicDescription": "This event counts retired load uops which da= ta sources were load uops missed L1 but hit a fill buffer due to a precedin= g miss to the same cache line with the data not ready. Note: Only two data-= sources of L1/FB are applicable for AVX-256bit even though the correspondi= ng AVX load could be serviced by a deeper level in the memory hierarchy. Da= ta source is reported for the Low-half load.", "SampleAfterValue": "100003", "UMask": "0x40" }, @@ -412,7 +412,7 @@ "EventCode": "0xD1", "EventName": "MEM_LOAD_UOPS_RETIRED.L1_HIT", "PEBS": "1", - "PublicDescription": "This event counts retired load uops which da= ta sources were hits in the nearest-level (L1) cache.\nNote: Only two data-= sources of L1/FB are applicable for AVX-256bit even though the correspondi= ng AVX load could be serviced by a deeper level in the memory hierarchy. Da= ta source is reported for the Low-half load. This event also counts SW pref= etches independent of the actual data source.", + "PublicDescription": "This event counts retired load uops which da= ta sources were hits in the nearest-level (L1) cache. Note: Only two data-s= ources of L1/FB are applicable for AVX-256bit even though the correspondin= g AVX load could be serviced by a deeper level in the memory hierarchy. Dat= a source is reported for the Low-half load. This event also counts SW prefe= tches independent of the actual data source.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -601,7 +601,7 @@ "Counter": "0,1,2,3", "EventCode": "0xb2", "EventName": "OFFCORE_REQUESTS_BUFFER.SQ_FULL", - "PublicDescription": "This event counts the number of cases when t= he offcore requests buffer cannot take more entries for the core. This can = happen when the superqueue does not contain eligible entries, or when L1D w= riteback pending FIFO requests is full.\nNote: Writeback pending FIFO has s= ix entries.", + "PublicDescription": "This event counts the number of cases when t= he offcore requests buffer cannot take more entries for the core. This can = happen when the superqueue does not contain eligible entries, or when L1D w= riteback pending FIFO requests is full. Note: Writeback pending FIFO has si= x entries.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -664,7 +664,7 @@ "Errata": "BDM76", "EventCode": "0x60", "EventName": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD", - "PublicDescription": "This event counts the number of offcore outs= tanding Demand Data Read transactions in the super queue (SQ) every cycle. = A transaction is considered to be in the Offcore outstanding state between = L2 miss and transaction completion sent to requestor. See the corresponding= Umask under OFFCORE_REQUESTS.\nNote: A prefetch promoted to Demand is coun= ted from the promotion point.", + "PublicDescription": "This event counts the number of offcore outs= tanding Demand Data Read transactions in the super queue (SQ) every cycle. = A transaction is considered to be in the Offcore outstanding state between = L2 miss and transaction completion sent to requestor. See the corresponding= Umask under OFFCORE_REQUESTS. Note: A prefetch promoted to Demand is count= ed from the promotion point.", "SampleAfterValue": "2000003", "UMask": "0x1" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwell/frontend.json b/tools= /perf/pmu-events/arch/x86/broadwell/frontend.json index db3488abf9fc..018020a51436 100644 --- a/tools/perf/pmu-events/arch/x86/broadwell/frontend.json +++ b/tools/perf/pmu-events/arch/x86/broadwell/frontend.json @@ -12,7 +12,7 @@ "Counter": "0,1,2,3", "EventCode": "0xAB", "EventName": "DSB2MITE_SWITCHES.PENALTY_CYCLES", - "PublicDescription": "This event counts Decode Stream Buffer (DSB)= -to-MITE switch true penalty cycles. These cycles do not include uops route= d through because of the switch itself, for example, when Instruction Decod= e Queue (IDQ) pre-allocation is unavailable, or Instruction Decode Queue (I= DQ) is full. SBD-to-MITE switch true penalty cycles happen after the merge = mux (MM) receives Decode Stream Buffer (DSB) Sync-indication until receivin= g the first MITE uop. \nMM is placed before Instruction Decode Queue (IDQ) = to merge uops being fed from the MITE and Decode Stream Buffer (DSB) paths.= Decode Stream Buffer (DSB) inserts the Sync-indication whenever a Decode S= tream Buffer (DSB)-to-MITE switch occurs.\nPenalty: A Decode Stream Buffer = (DSB) hit followed by a Decode Stream Buffer (DSB) miss can cost up to six = cycles in which no uops are delivered to the IDQ. Most often, such switches= from the Decode Stream Buffer (DSB) to the legacy pipeline cost 02 cycles.= ", + "PublicDescription": "This event counts Decode Stream Buffer (DSB)= -to-MITE switch true penalty cycles. These cycles do not include uops route= d through because of the switch itself, for example, when Instruction Decod= e Queue (IDQ) pre-allocation is unavailable, or Instruction Decode Queue (I= DQ) is full. SBD-to-MITE switch true penalty cycles happen after the merge = mux (MM) receives Decode Stream Buffer (DSB) Sync-indication until receivin= g the first MITE uop. MM is placed before Instruction Decode Queue (IDQ) t= o merge uops being fed from the MITE and Decode Stream Buffer (DSB) paths. = Decode Stream Buffer (DSB) inserts the Sync-indication whenever a Decode St= ream Buffer (DSB)-to-MITE switch occurs. Penalty: A Decode Stream Buffer (D= SB) hit followed by a Decode Stream Buffer (DSB) miss can cost up to six cy= cles in which no uops are delivered to the IDQ. Most often, such switches f= rom the Decode Stream Buffer (DSB) to the legacy pipeline cost 02 cycles.", "SampleAfterValue": "2000003", "UMask": "0x2" }, @@ -212,7 +212,7 @@ "Counter": "0,1,2,3", "EventCode": "0x9C", "EventName": "IDQ_UOPS_NOT_DELIVERED.CORE", - "PublicDescription": "This event counts the number of uops not del= ivered to Resource Allocation Table (RAT) per thread adding 4 x when Resou= rce Allocation Table (RAT) is not stalled and Instruction Decode Queue (IDQ= ) delivers x uops to Resource Allocation Table (RAT) (where x belongs to {0= ,1,2,3}). Counting does not cover cases when:\n a. IDQ-Resource Allocation = Table (RAT) pipe serves the other thread;\n b. Resource Allocation Table (R= AT) is stalled for the thread (including uop drops and clear BE conditions)= ; \n c. Instruction Decode Queue (IDQ) delivers four uops.", + "PublicDescription": "This event counts the number of uops not del= ivered to Resource Allocation Table (RAT) per thread adding 4 x when Resou= rce Allocation Table (RAT) is not stalled and Instruction Decode Queue (IDQ= ) delivers x uops to Resource Allocation Table (RAT) (where x belongs to {0= ,1,2,3}). Counting does not cover cases when: a. IDQ-Resource Allocation T= able (RAT) pipe serves the other thread; b. Resource Allocation Table (RAT= ) is stalled for the thread (including uop drops and clear BE conditions); = c. Instruction Decode Queue (IDQ) delivers four uops.", "SampleAfterValue": "2000003", "UMask": "0x1" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwell/memory.json b/tools/p= erf/pmu-events/arch/x86/broadwell/memory.json index 77fbfe99a522..bbc5d7deea6b 100644 --- a/tools/perf/pmu-events/arch/x86/broadwell/memory.json +++ b/tools/perf/pmu-events/arch/x86/broadwell/memory.json @@ -68,7 +68,7 @@ "Counter": "0,1,2,3", "EventCode": "0xc8", "EventName": "HLE_RETIRED.START", - "PublicDescription": "Number of times we entered an HLE region\n d= oes not count nested transactions.", + "PublicDescription": "Number of times we entered an HLE region do= es not count nested transactions.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -77,7 +77,7 @@ "Counter": "0,1,2,3", "EventCode": "0xC3", "EventName": "MACHINE_CLEARS.MEMORY_ORDERING", - "PublicDescription": "This event counts the number of memory order= ing Machine Clears detected. Memory Ordering Machine Clears can result from= one of the following:\n1. memory disambiguation,\n2. external snoop, or\n3= . cross SMT-HW-thread snoop (stores) hitting load buffer.", + "PublicDescription": "This event counts the number of memory order= ing Machine Clears detected. Memory Ordering Machine Clears can result from= one of the following: 1. memory disambiguation, 2. external snoop, or 3. c= ross SMT-HW-thread snoop (stores) hitting load buffer.", "SampleAfterValue": "100003", "UMask": "0x2" }, @@ -2226,7 +2226,7 @@ "Counter": "0,1,2,3", "EventCode": "0xc9", "EventName": "RTM_RETIRED.ABORTED", - "PEBS": "2", + "PEBS": "1", "PublicDescription": "Number of times RTM abort was triggered .", "SampleAfterValue": "2000003", "UMask": "0x4" @@ -2290,7 +2290,7 @@ "Counter": "0,1,2,3", "EventCode": "0xc9", "EventName": "RTM_RETIRED.START", - "PublicDescription": "Number of times we entered an RTM region\n d= oes not count nested transactions.", + "PublicDescription": "Number of times we entered an RTM region do= es not count nested transactions.", "SampleAfterValue": "2000003", "UMask": "0x1" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwell/metricgroups.json b/t= ools/perf/pmu-events/arch/x86/broadwell/metricgroups.json index 4193c90c3459..0863375bdead 100644 --- a/tools/perf/pmu-events/arch/x86/broadwell/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/broadwell/metricgroups.json @@ -9,6 +9,7 @@ "BvCB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvFB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvIO": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvML": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvMP": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvMS": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", @@ -34,6 +35,7 @@ "InsType": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -51,6 +53,7 @@ "Pipeline": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "PortsUtil": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", "Power": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "Prefetches": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", "Ret": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", "Retire": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", "SMT": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -78,6 +81,7 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -103,6 +107,7 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", diff --git a/tools/perf/pmu-events/arch/x86/broadwell/pipeline.json b/tools= /perf/pmu-events/arch/x86/broadwell/pipeline.json index c03f77539362..962cd07eb307 100644 --- a/tools/perf/pmu-events/arch/x86/broadwell/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/broadwell/pipeline.json @@ -379,7 +379,7 @@ "BriefDescription": "Reference cycles when the core is not in halt= state.", "Counter": "Fixed counter 2", "EventName": "CPU_CLK_UNHALTED.REF_TSC", - "PublicDescription": "This event counts the number of reference cy= cles when the core is not in a halt state. The core enters the halt state w= hen it is running the HLT instruction or the MWAIT instruction. This event = is not affected by core frequency changes (for example, P states, TM2 trans= itions) but has the same incrementing frequency as the time stamp counter. = This event can approximate elapsed time while the core was not in a halt st= ate. This event has a constant ratio with the CPU_CLK_UNHALTED.REF_XCLK eve= nt. It is counted on a dedicated fixed counter, leaving the four (eight whe= n Hyperthreading is disabled) programmable counters available for other eve= nts. \nNote: On all current platforms this event stops counting during 'thr= ottling (TM)' states duty off periods the processor is 'halted'. This even= t is clocked by base clock (100 Mhz) on Sandy Bridge. The counter update is= done at a lower clock rate then the core clock the overflow status bit for= this counter may appear 'sticky'. After the counter has overflowed and so= ftware clears the overflow status bit and resets the counter to less than M= AX. The reset value to the counter is not clocked immediately so the overfl= ow status bit will flip 'high (1)' and generate another PMI (if enabled) af= ter which the reset value gets clocked into the counter. Therefore, softwar= e will get the interrupt, read the overflow status bit '1 for bit 34 while = the counter value is less than MAX. Software should ignore this case.", + "PublicDescription": "This event counts the number of reference cy= cles when the core is not in a halt state. The core enters the halt state w= hen it is running the HLT instruction or the MWAIT instruction. This event = is not affected by core frequency changes (for example, P states, TM2 trans= itions) but has the same incrementing frequency as the time stamp counter. = This event can approximate elapsed time while the core was not in a halt st= ate. This event has a constant ratio with the CPU_CLK_UNHALTED.REF_XCLK eve= nt. It is counted on a dedicated fixed counter, leaving the four (eight whe= n Hyperthreading is disabled) programmable counters available for other eve= nts. Note: On all current platforms this event stops counting during 'thro= ttling (TM)' states duty off periods the processor is 'halted'. This event= is clocked by base clock (100 Mhz) on Sandy Bridge. The counter update is = done at a lower clock rate then the core clock the overflow status bit for = this counter may appear 'sticky'. After the counter has overflowed and sof= tware clears the overflow status bit and resets the counter to less than MA= X. The reset value to the counter is not clocked immediately so the overflo= w status bit will flip 'high (1)' and generate another PMI (if enabled) aft= er which the reset value gets clocked into the counter. Therefore, software= will get the interrupt, read the overflow status bit '1 for bit 34 while t= he counter value is less than MAX. Software should ignore this case.", "SampleAfterValue": "2000003", "UMask": "0x3" }, @@ -579,7 +579,7 @@ "BriefDescription": "Instructions retired from execution.", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PublicDescription": "This event counts the number of instructions= retired from execution. For instructions that consist of multiple micro-op= s, this event counts the retirement of the last micro-op of the instruction= . Counting continues during hardware interrupts, traps, and inside interrup= t handlers. \nNotes: INST_RETIRED.ANY is counted by a designated fixed coun= ter, leaving the four (eight when Hyperthreading is disabled) programmable = counters available for other events. INST_RETIRED.ANY_P is counted by a pro= grammable counter and it is an architectural performance event. \nCounting:= Faulting executions of GETSEC/VM entry/VM Exit/MWait will not count as ret= ired instructions.", + "PublicDescription": "This event counts the number of instructions= retired from execution. For instructions that consist of multiple micro-op= s, this event counts the retirement of the last micro-op of the instruction= . Counting continues during hardware interrupts, traps, and inside interrup= t handlers. Notes: INST_RETIRED.ANY is counted by a designated fixed count= er, leaving the four (eight when Hyperthreading is disabled) programmable c= ounters available for other events. INST_RETIRED.ANY_P is counted by a prog= rammable counter and it is an architectural performance event. Counting: F= aulting executions of GETSEC/VM entry/VM Exit/MWait will not count as retir= ed instructions.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -654,7 +654,7 @@ "Counter": "0,1,2,3", "EventCode": "0x03", "EventName": "LD_BLOCKS.STORE_FORWARD", - "PublicDescription": "This event counts how many times the load op= eration got the true Block-on-Store blocking code preventing store forwardi= ng. This includes cases when:\n - preceding store conflicts with the load (= incomplete overlap);\n - store forwarding is impossible due to u-arch limit= ations;\n - preceding lock RMW operations are not forwarded;\n - store has = the no-forward bit set (uncacheable/page-split/masked stores);\n - all-bloc= king stores are used (mostly, fences and port I/O);\nand others.\nThe most = common case is a load blocked due to its address range overlapping with a p= receding smaller uncompleted store. Note: This event does not take into acc= ount cases of out-of-SW-control (for example, SbTailHit), unknown physical = STA, and cases of blocking loads on store due to being non-WB memory type o= r a lock. These cases are covered by other events.\nSee the table of not su= pported store forwards in the Optimization Guide.", + "PublicDescription": "This event counts how many times the load op= eration got the true Block-on-Store blocking code preventing store forwardi= ng. This includes cases when: - preceding store conflicts with the load (i= ncomplete overlap); - store forwarding is impossible due to u-arch limitat= ions; - preceding lock RMW operations are not forwarded; - store has the = no-forward bit set (uncacheable/page-split/masked stores); - all-blocking = stores are used (mostly, fences and port I/O); and others. The most common = case is a load blocked due to its address range overlapping with a precedin= g smaller uncompleted store. Note: This event does not take into account ca= ses of out-of-SW-control (for example, SbTailHit), unknown physical STA, an= d cases of blocking loads on store due to being non-WB memory type or a loc= k. These cases are covered by other events. See the table of not supported = store forwards in the Optimization Guide.", "SampleAfterValue": "100003", "UMask": "0x2" }, @@ -822,7 +822,7 @@ "Counter": "0,1,2,3", "EventCode": "0x5E", "EventName": "RS_EVENTS.EMPTY_CYCLES", - "PublicDescription": "This event counts cycles during which the re= servation station (RS) is empty for the thread.\nNote: In ST-mode, not acti= ve thread should drive 0. This is usually caused by severely costly branch = mispredictions, or allocator/FE issues.", + "PublicDescription": "This event counts cycles during which the re= servation station (RS) is empty for the thread. Note: In ST-mode, not activ= e thread should drive 0. This is usually caused by severely costly branch m= ispredictions, or allocator/FE issues.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -1177,7 +1177,7 @@ "Counter": "0,1,2,3", "EventCode": "0x0E", "EventName": "UOPS_ISSUED.FLAGS_MERGE", - "PublicDescription": "Number of flags-merge uops being allocated. = Such uops considered perf sensitive\n added by GSR u-arch.", + "PublicDescription": "Number of flags-merge uops being allocated. = Such uops considered perf sensitive added by GSR u-arch.", "SampleAfterValue": "2000003", "UMask": "0x10" }, diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 3cae9f55bd9f..18f3fa1f1dab 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -3,7 +3,7 @@ GenuineIntel-6-(97|9A|B7|BA|BF),v1.28,alderlake,core GenuineIntel-6-BE,v1.28,alderlaken,core GenuineIntel-6-C[56],v1.06,arrowlake,core GenuineIntel-6-(1C|26|27|35|36),v5,bonnell,core -GenuineIntel-6-(3D|47),v29,broadwell,core +GenuineIntel-6-(3D|47),v30,broadwell,core GenuineIntel-6-56,v11,broadwellde,core GenuineIntel-6-4F,v22,broadwellx,core GenuineIntel-6-55-[56789ABCDEF],v1.22,cascadelakex,core --=20 2.48.0.rc2.279.g1de40edade-goog From nobody Sun Dec 14 13:49:51 2025 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DAFED192D6B for ; Thu, 16 Jan 2025 06:44:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009875; cv=none; b=sTbdx90DBiswQs2/3STb+rES2XUEb2CBwlPrm2+nDJnWauXJROoXqUCQ/KAziU+D0zjv8Tc1d466JrsC+Ijer1Kk0uHt8USUAKNZDNzVZYVe35T89D0mabjhpAKD5NZ1jepRSzfRhM5BFCCly3GEFW047F3geHastdvHKS8n6zw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009875; c=relaxed/simple; bh=FPtQMpot9DTINHR1D7nW8xu9nss0FAv5/hcJHPxbC/s=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=uk4nsn70dEIZq1bL40vtt+mAwTcGBbL520S3uNfjsSnnepzVQ+baqQZeV2bH88MQmHS1tI9jK1oBL3pMsJ8E/H0aTEPBKgG8m7rz0AZZWAjL+5RTfgDhXaYDzixP/lVuPca0GAmhDwj9fQLsTQOCMZZwlNTM/am7dGVe3yjIZ7U= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=l9Oe/vW8; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="l9Oe/vW8" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e5505f613e5so4442907276.0 for ; Wed, 15 Jan 2025 22:44:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737009855; x=1737614655; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=18KLxhu3NmgO57lnlVeXxOPzUew0iR0mjcpRt26dvLo=; b=l9Oe/vW8XmTZBosvbe1givjNPSnMgE9sx1HU3dnvTbEG478BXTQ1NZM65Qaqq2czZZ uB2dUhEz+lrfX7c8NM2/7IjHrEb3QE8cDsBa9/ZNEsrKrmUmrxu74zxkciaqRWQeRWkX uahJu5+P1+OVyZ74ekyV+gUlefaRQH+B8Jj4hffPJstZB3YUM6Zhr95DZfGdTlPDrBQ6 v4dYQbBqkRpNoRypn6DYLJ2hipJe+kWrllbQ4z0L9+vdjV4t/8qlVCgOrmLcvUvN6+kh VHlFmaPMxgv2fR35Vs7THBx1ebFu2lGF2xLXWNU8wvZidEMXdZNdsTEy5Tt7lDHu2lnZ Ffzg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737009855; x=1737614655; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=18KLxhu3NmgO57lnlVeXxOPzUew0iR0mjcpRt26dvLo=; b=fo9UZXQ+xR22Af4uWS6TrKeO0zSK4nrpKmIfUSPeuAfIuKrOfv1vJPy+4luYdlI55f 35PR7DU9EVW5CzsBCkToKmICaIOYiU1wSYSqW1ZU10b/C9YYsbre88zZQHcWpUH3OrVn /MepAPc63VukWcEXoE6H3K9M5MwqRKIFwPd7+wKKit/XzPVkbLe+L6fDJdhL6rrijjci JwDzbf5Y/BgAK8Xw0RONKdytwGsgJKtkQ5dctgyVbSfaMwm6PPWO6hnFVknGWtOKjCSl Y2us6Y8uGM1OzuqBl9Sc0XD42tu818O/AFQ5KemH9Vvxeyo/Ha6xBEi7U7UDmT5vmswj mZSQ== X-Forwarded-Encrypted: i=1; AJvYcCUZBYYypV28BIECNfuChin2vYBVlfYSZ/JPehtW1espift2tDYi2UtBqFY6v48f56Btvetq58u3ljRzTeI=@vger.kernel.org X-Gm-Message-State: AOJu0YyWr7LhyCM33KoA9uVicKpL1xaCHDOrF1P4+0Q8cq5Hx0dA8ttY 1r3eneDqvR1zPXINMz+nrll3Tigzj5mXFlBXF80i2c7i29MyKBgtZ3xBaLyqht//tUSThXPXXe0 OSg+ihQ== X-Google-Smtp-Source: AGHT+IFKTUjUm83gOpXy0wV+LcOtlkOaykSzbXuY6KXIi0KH12Z9dq/YUTg4YKHM8SzDtwyzJYqd1SPoYoLV X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:7bd:ce19:c701:bfa3]) (user=irogers job=sendgmr) by 2002:a25:a247:0:b0:e49:51ea:9c80 with SMTP id 3f1490d57ef6-e578a22befdmr19313276.5.1737009855557; Wed, 15 Jan 2025 22:44:15 -0800 (PST) Date: Wed, 15 Jan 2025 22:43:37 -0800 In-Reply-To: <20250116064355.344245-1-irogers@google.com> Message-Id: <20250116064355.344245-6-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250116064355.344245-1-irogers@google.com> X-Mailer: git-send-email 2.48.0.rc2.279.g1de40edade-goog Subject: [PATCH v2 05/23] perf vendor events: Update BroadwellDE events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v11 to v12. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v12: https://github.com/intel/perfmon/commit/e0b83388d545e527933031ddb2a1d22d650= 40de1 The TMA 5.01 update is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-developed-by: Caleb Biggers Signed-off-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../arch/x86/broadwellde/bdwde-metrics.json | 256 ++++++++++-------- .../arch/x86/broadwellde/cache.json | 10 +- .../arch/x86/broadwellde/frontend.json | 4 +- .../arch/x86/broadwellde/memory.json | 6 +- .../arch/x86/broadwellde/metricgroups.json | 5 + .../arch/x86/broadwellde/pipeline.json | 10 +- .../arch/x86/broadwellde/uncore-cache.json | 28 +- .../x86/broadwellde/uncore-interconnect.json | 16 +- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 9 files changed, 184 insertions(+), 153 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/broadwellde/bdwde-metrics.json = b/tools/perf/pmu-events/arch/x86/broadwellde/bdwde-metrics.json index 2e1380248684..b03a5f2bcd82 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellde/bdwde-metrics.json +++ b/tools/perf/pmu-events/arch/x86/broadwellde/bdwde-metrics.json @@ -74,7 +74,7 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_4k_aliasing > 0.2) & ((tma_l1_bound > 0.1= ) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", "ScaleUnit": "100%" }, @@ -84,7 +84,7 @@ "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT= .PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_= info_thread_slots", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", - "MetricThreshold": "tma_alu_op_utilization > 0.4", + "MetricThreshold": "(tma_alu_op_utilization > 0.4)", "ScaleUnit": "100%" }, { @@ -92,7 +92,7 @@ "MetricExpr": "66 * OTHER_ASSISTS.ANY_WB_ASSIST / tma_info_thread_= slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "(tma_assists > 0.1) & ((tma_microcode_sequence= r > 0.05) & ((tma_heavy_operations > 0.1)))", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%" }, @@ -102,7 +102,7 @@ "MetricExpr": "1 - (tma_frontend_bound + tma_bad_speculation + tma= _retiring)", "MetricGroup": "BvOB;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", - "MetricThreshold": "tma_backend_bound > 0.2", + "MetricThreshold": "(tma_backend_bound > 0.2)", "MetricgroupNoGroup": "TopdownL1", "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", "ScaleUnit": "100%" @@ -112,7 +112,7 @@ "MetricExpr": "(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * = (INT_MISC.RECOVERY_CYCLES_ANY / 2 if #SMT_on else INT_MISC.RECOVERY_CYCLES)= ) / tma_info_thread_slots", "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_bad_speculation", - "MetricThreshold": "tma_bad_speculation > 0.15", + "MetricThreshold": "(tma_bad_speculation > 0.15)", "MetricgroupNoGroup": "TopdownL1", "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", "ScaleUnit": "100%" @@ -123,7 +123,7 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL= _BRANCHES + MACHINE_CLEARS.COUNT) * tma_bad_speculation", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueBM", "MetricName": "tma_branch_mispredicts", - "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", + "MetricThreshold": "(tma_branch_mispredicts > 0.1) & ((tma_bad_spe= culation > 0.15))", "MetricgroupNoGroup": "TopdownL2", "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_info_bad_spec_branch_misprediction_cost, tma_mispredicts_resteers", "ScaleUnit": "100%" @@ -133,7 +133,7 @@ "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS= .COUNT + BACLEARS.ANY) / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "(tma_branch_resteers > 0.05) & ((tma_fetch_lat= ency > 0.1) & ((tma_frontend_bound > 0.15)))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", "ScaleUnit": "100%" }, @@ -143,7 +143,7 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "(tma_cisc > 0.1) & ((tma_microcode_sequencer >= 0.05) & ((tma_heavy_operations > 0.1)))", "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", "ScaleUnit": "100%" }, @@ -152,7 +152,7 @@ "MetricExpr": "MACHINE_CLEARS.COUNT * tma_branch_resteers / (BR_MI= SP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "(tma_clears_resteers > 0.05) & ((tma_branch_re= steers > 0.05) & ((tma_fetch_latency > 0.1) & ((tma_frontend_bound > 0.15))= ))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, @@ -160,9 +160,9 @@ "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(60 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM * (1 = + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_= UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS= _L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LO= AD_UOPS_RETIRED.L3_MISS))) + 43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS *= (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_L= OAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_= UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + ME= M_LOAD_UOPS_RETIRED.L3_MISS)))) / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_contested_accesses > 0.05) & ((tma_l3_bou= nd > 0.05) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related m= etrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma_remote= _cache", "ScaleUnit": "100%" }, @@ -172,7 +172,7 @@ "MetricExpr": "tma_backend_bound - tma_memory_bound", "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_b= ackend_bound_group", "MetricName": "tma_core_bound", - "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", + "MetricThreshold": "(tma_core_bound > 0.1) & ((tma_backend_bound >= 0.2))", "MetricgroupNoGroup": "TopdownL2", "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", "ScaleUnit": "100%" @@ -183,7 +183,7 @@ "MetricExpr": "43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT * (1 + = MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UO= PS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L= 3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD= _UOPS_RETIRED.L3_MISS))) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_data_sharing > 0.05) & ((tma_l3_bound > 0= .05) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, @@ -192,8 +192,8 @@ "MetricExpr": "ARITH.FPU_DIV_ACTIVE / tma_info_core_core_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", - "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", + "MetricThreshold": "(tma_divider > 0.2) & ((tma_core_bound > 0.1) = & ((tma_backend_bound > 0.2)))", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIV_ACTIVE", "ScaleUnit": "100%" }, { @@ -202,8 +202,8 @@ "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_= RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS)) * CYCLE_ACTIVITY.STALL= S_L2_MISS / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS_PS", + "MetricThreshold": "(tma_dram_bound > 0.1) & ((tma_memory_bound > = 0.2) & ((tma_backend_bound > 0.2)))", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%" }, { @@ -211,7 +211,7 @@ "MetricExpr": "(IDQ.ALL_DSB_CYCLES_ANY_UOPS - IDQ.ALL_DSB_CYCLES_4= _UOPS) / tma_info_core_core_clks / 2", "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", - "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", + "MetricThreshold": "(tma_dsb > 0.15) & ((tma_fetch_bandwidth > 0.2= ))", "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", "ScaleUnit": "100%" }, @@ -220,7 +220,7 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "(tma_dsb_switches > 0.05) & ((tma_fetch_latenc= y > 0.1) & ((tma_frontend_bound > 0.15)))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_frontend_dsb_cove= rage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, @@ -229,7 +229,7 @@ "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + cpu@DTLB_LOAD_MISS= ES.WALK_DURATION\\,cmask\\=3D1@ + 7 * DTLB_LOAD_MISSES.WALK_COMPLETED) / tm= a_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_dtlb_load > 0.1) & ((tma_l1_bound > 0.1) = & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store", "ScaleUnit": "100%" }, @@ -238,7 +238,7 @@ "MetricExpr": "(8 * DTLB_STORE_MISSES.STLB_HIT + cpu@DTLB_STORE_MI= SSES.WALK_DURATION\\,cmask\\=3D1@ + 7 * DTLB_STORE_MISSES.WALK_COMPLETED) /= tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_dtlb_store > 0.05) & ((tma_store_bound > = 0.2) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load", "ScaleUnit": "100%" }, @@ -246,9 +246,9 @@ "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D1@ / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", - "MetricThreshold": "tma_fb_full > 0.3", + "MetricThreshold": "(tma_fb_full > 0.3)", "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_system_dram_bw_use, tma_mem_ba= ndwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", "ScaleUnit": "100%" }, @@ -257,9 +257,9 @@ "MetricExpr": "tma_frontend_bound - tma_fetch_latency", "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_= frontend_bound_group;tma_issueFB", "MetricName": "tma_fetch_bandwidth", - "MetricThreshold": "tma_fetch_bandwidth > 0.2", + "MetricThreshold": "(tma_fetch_bandwidth > 0.2)", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LA= TENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_sw= itches, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1;FRONTEND_RETIRED.LATEN= CY_GE_1;FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches, t= ma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { @@ -267,7 +267,7 @@ "MetricExpr": "4 * IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE= / tma_info_thread_slots", "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend= _bound_group", "MetricName": "tma_fetch_latency", - "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", + "MetricThreshold": "(tma_fetch_latency > 0.1) & ((tma_frontend_bou= nd > 0.15))", "MetricgroupNoGroup": "TopdownL2", "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", "ScaleUnit": "100%" @@ -277,7 +277,7 @@ "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", - "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", + "MetricThreshold": "(tma_fp_arith > 0.2) & ((tma_light_operations = > 0.6))", "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", "ScaleUnit": "100%" }, @@ -286,7 +286,7 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "MetricThreshold": "(tma_fp_scalar > 0.1) & ((tma_fp_arith > 0.2) = & ((tma_light_operations > 0.6)))", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -295,7 +295,7 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "MetricThreshold": "(tma_fp_vector > 0.1) & ((tma_fp_arith > 0.2) = & ((tma_light_operations > 0.6)))", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -304,8 +304,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "(tma_fp_vector_128b > 0.1) & ((tma_fp_vector >= 0.1) & ((tma_fp_arith > 0.2) & ((tma_light_operations > 0.6))))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -313,8 +313,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "(tma_fp_vector_256b > 0.1) & ((tma_fp_vector >= 0.1) & ((tma_fp_arith > 0.2) & ((tma_light_operations > 0.6))))", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -322,7 +322,7 @@ "MetricExpr": "IDQ_UOPS_NOT_DELIVERED.CORE / tma_info_thread_slots= ", "MetricGroup": "BvFB;BvIO;PGO;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_frontend_bound", - "MetricThreshold": "tma_frontend_bound > 0.15", + "MetricThreshold": "(tma_frontend_bound > 0.15)", "MetricgroupNoGroup": "TopdownL1", "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", "ScaleUnit": "100%" @@ -332,9 +332,9 @@ "MetricExpr": "tma_microcode_sequencer", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", - "MetricThreshold": "tma_heavy_operations > 0.1", + "MetricThreshold": "(tma_heavy_operations > 0.1)", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences.([ICL+] Note this may overcoun= t due to approximation using indirect events; [ADL+]). Sample with: UOPS_RE= TIRED.HEAVY", "ScaleUnit": "100%" }, { @@ -342,23 +342,23 @@ "MetricExpr": "ICACHE.IFDATA_STALL / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "(tma_icache_misses > 0.05) & ((tma_fetch_laten= cy > 0.1) & ((tma_frontend_bound > 0.15)))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", "MetricExpr": "tma_info_inst_mix_instructions / (UOPS_RETIRED.RETI= RE_SLOTS / UOPS_ISSUED.ANY * BR_MISP_EXEC.INDIRECT)", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" + "MetricThreshold": "(tma_info_bad_spec_ipmisp_indirect < 1000)" }, { "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", "MetricGroup": "Bad;BadSpec;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmispredict", - "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200" + "MetricThreshold": "(tma_info_bad_spec_ipmispredict < 200)" }, { "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", @@ -396,7 +396,7 @@ "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + LSD.UOPS + IDQ.MITE_= UOPS + IDQ.MS_UOPS)", "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_frontend_dsb_coverage", - "MetricThreshold": "tma_info_frontend_dsb_coverage < 0.7 & tma_inf= o_thread_ipc / 4 > 0.35", + "MetricThreshold": "(tma_info_frontend_dsb_coverage < 0.7) & ((tma= _info_thread_ipc / 4) > 0.35)", "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_inst_mix_iptb, tma_lcp" }, { @@ -405,6 +405,12 @@ "MetricGroup": "Fed", "MetricName": "tma_info_frontend_ipunknown_branch" }, + { + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, { "BriefDescription": "Branch instructions per taken branch.", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", @@ -423,7 +429,7 @@ "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = FP_ARITH_INST_RETIRED.VECTOR)", "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", - "MetricThreshold": "tma_info_inst_mix_iparith < 10", + "MetricThreshold": "(tma_info_inst_mix_iparith < 10)", "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." }, { @@ -431,7 +437,7 @@ "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", - "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", + "MetricThreshold": "(tma_info_inst_mix_iparith_avx128 < 10)", "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." }, { @@ -439,7 +445,7 @@ "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", - "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", + "MetricThreshold": "(tma_info_inst_mix_iparith_avx256 < 10)", "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." }, { @@ -447,7 +453,7 @@ "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOU= BLE", "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", - "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", + "MetricThreshold": "(tma_info_inst_mix_iparith_scalar_dp < 10)", "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." }, { @@ -455,7 +461,7 @@ "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SIN= GLE", "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", - "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", + "MetricThreshold": "(tma_info_inst_mix_iparith_scalar_sp < 10)", "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." }, { @@ -463,42 +469,42 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", "MetricGroup": "Branches;Fed;InsType", "MetricName": "tma_info_inst_mix_ipbranch", - "MetricThreshold": "tma_info_inst_mix_ipbranch < 8" + "MetricThreshold": "(tma_info_inst_mix_ipbranch < 8)" }, { "BriefDescription": "Instructions per (near) call (lower number me= ans higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_ipcall", - "MetricThreshold": "tma_info_inst_mix_ipcall < 200" + "MetricThreshold": "(tma_info_inst_mix_ipcall < 200)" }, { "BriefDescription": "Instructions per Floating Point (FP) Operatio= n (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_= FLOPS + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_ipflop", - "MetricThreshold": "tma_info_inst_mix_ipflop < 10" + "MetricThreshold": "(tma_info_inst_mix_ipflop < 10)" }, { "BriefDescription": "Instructions per Load (lower number means hig= her occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_LOADS", "MetricGroup": "InsType", "MetricName": "tma_info_inst_mix_ipload", - "MetricThreshold": "tma_info_inst_mix_ipload < 3" + "MetricThreshold": "(tma_info_inst_mix_ipload < 3)" }, { "BriefDescription": "Instructions per Store (lower number means hi= gher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / MEM_UOPS_RETIRED.ALL_STORES", "MetricGroup": "InsType", "MetricName": "tma_info_inst_mix_ipstore", - "MetricThreshold": "tma_info_inst_mix_ipstore < 8" + "MetricThreshold": "(tma_info_inst_mix_ipstore < 8)" }, { "BriefDescription": "Instructions per taken branch", "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 9", + "MetricThreshold": "(tma_info_inst_mix_iptb < 4 * 2 + 1)", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_frontend_dsb_coverage, = tma_lcp" }, { @@ -521,7 +527,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw" }, @@ -533,7 +539,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw" }, @@ -575,7 +581,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw" }, @@ -594,7 +600,7 @@ { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency" }, { @@ -623,10 +629,10 @@ "MetricExpr": "(cpu@ITLB_MISSES.WALK_DURATION\\,cmask\\=3D1@ + cpu= @DTLB_LOAD_MISSES.WALK_DURATION\\,cmask\\=3D1@ + cpu@DTLB_STORE_MISSES.WALK= _DURATION\\,cmask\\=3D1@ + 7 * (DTLB_STORE_MISSES.WALK_COMPLETED + DTLB_LOA= D_MISSES.WALK_COMPLETED + ITLB_MISSES.WALK_COMPLETED)) / tma_info_core_core= _clks", "MetricGroup": "Mem;MemoryTLB", "MetricName": "tma_info_memory_tlb_page_walks_utilization", - "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5" + "MetricThreshold": "(tma_info_memory_tlb_page_walks_utilization > = 0.5)" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", + "BriefDescription": "", "MetricExpr": "UOPS_EXECUTED.THREAD / (cpu@UOPS_EXECUTED.CORE\\,cm= ask\\=3D1@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" @@ -639,7 +645,7 @@ }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -657,14 +663,14 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / duration_time", + "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / tma_info_system_time", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_mem_bandwidth,= tma_sq_full" }, { "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / duration_time", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / tma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" @@ -674,7 +680,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" + "MetricThreshold": "(tma_info_system_ipfarbranch < 1000000)" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", @@ -687,7 +693,20 @@ "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THRE= AD", "MetricGroup": "OS", "MetricName": "tma_info_system_kernel_utilization", - "MetricThreshold": "tma_info_system_kernel_utilization > 0.05" + "MetricThreshold": "(tma_info_system_kernel_utilization > 0.05)" + }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "((tma_info_system_mux > 1.1)|(tma_info_system_= mux < 0.9))" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "power@energy\\-pkg@ * 61 / (tma_info_system_time * = 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", @@ -701,6 +720,13 @@ "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "(tma_info_system_time < 1)" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -743,31 +769,31 @@ "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / INST_RETIRED.ANY", "MetricGroup": "Pipeline;Ret;Retire", "MetricName": "tma_info_thread_uoppi", - "MetricThreshold": "tma_info_thread_uoppi > 1.05" + "MetricThreshold": "(tma_info_thread_uoppi > 1.05)" }, { "BriefDescription": "Uops per taken branch", "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TA= KEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 6" + "MetricThreshold": "(tma_info_thread_uptb < 4 * 1.5)" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + cpu@ITLB_MISSES.WALK_D= URATION\\,cmask\\=3D1@ + 7 * ITLB_MISSES.WALK_COMPLETED) / tma_info_thread_= clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "(tma_itlb_misses > 0.05) & ((tma_fetch_latency= > 0.1) & ((tma_frontend_bound > 0.15)))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB= _HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_micr= ocode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "(tma_l1_bound > 0.1) & ((tma_memory_bound > 0.= 2) & ((tma_backend_bound > 0.2)))", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { @@ -775,8 +801,8 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.ST= ALLS_L2_MISS) / tma_info_thread_clks", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT_PS", + "MetricThreshold": "(tma_l2_bound > 0.05) & ((tma_memory_bound > 0= .2) & ((tma_backend_bound > 0.2)))", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, { @@ -785,7 +811,7 @@ "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIR= ED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS) * CYCLE_ACTIVITY.STALLS_L2_M= ISS / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", + "MetricThreshold": "(tma_l3_bound > 0.05) & ((tma_memory_bound > 0= .2) & ((tma_backend_bound > 0.2)))", "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", "ScaleUnit": "100%" }, @@ -795,7 +821,7 @@ "MetricExpr": "29 * (MEM_LOAD_UOPS_RETIRED.L3_HIT * (1 + MEM_LOAD_= UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRE= D.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RET= IRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RET= IRED.L3_MISS))) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_l3_hit_latency > 0.1) & ((tma_l3_bound > = 0.05) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_mem_latency", "ScaleUnit": "100%" }, @@ -804,7 +830,7 @@ "MetricExpr": "ILD_STALL.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", + "MetricThreshold": "(tma_lcp > 0.05) & ((tma_fetch_latency > 0.1) = & ((tma_frontend_bound > 0.15)))", "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", "ScaleUnit": "100%" }, @@ -813,7 +839,7 @@ "MetricExpr": "tma_retiring - tma_heavy_operations", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", - "MetricThreshold": "tma_light_operations > 0.6", + "MetricThreshold": "(tma_light_operations > 0.6)", "MetricgroupNoGroup": "TopdownL2", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", "ScaleUnit": "100%" @@ -824,7 +850,7 @@ "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_2 + UOPS_DISPATCHED_PORT= .PORT_3 + UOPS_DISPATCHED_PORT.PORT_7 - UOPS_DISPATCHED_PORT.PORT_4) / (2 *= tma_info_core_core_clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_load_op_utilization", - "MetricThreshold": "tma_load_op_utilization > 0.6", + "MetricThreshold": "(tma_load_op_utilization > 0.6)", "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Load operations. Sample with: = UOPS_DISPATCHED.PORT_2_3_10", "ScaleUnit": "100%" }, @@ -832,9 +858,9 @@ "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_= STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_W= ITH_DEMAND_RFO) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_lock_latency > 0.2) & ((tma_l1_bound > 0.= 1) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, @@ -844,7 +870,7 @@ "MetricExpr": "tma_bad_speculation - tma_branch_mispredicts", "MetricGroup": "BadSpec;BvMS;MachineClears;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", "MetricName": "tma_machine_clears", - "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", + "MetricThreshold": "(tma_machine_clears > 0.1) & ((tma_bad_specula= tion > 0.15))", "MetricgroupNoGroup": "TopdownL2", "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", "ScaleUnit": "100%" @@ -852,9 +878,9 @@ { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_mem_bandwidth > 0.2) & ((tma_dram_bound >= 0.1) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_system_dram_bw_u= se, tma_sq_full", "ScaleUnit": "100%" }, @@ -863,7 +889,7 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_mem_latency > 0.1) & ((tma_dram_bound > 0= .1) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_l3_hit_latency", "ScaleUnit": "100%" }, @@ -873,7 +899,7 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_MEM_ANY + RESOURCE_STALLS.SB= ) / (CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - (UO= PS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if tma_info_thread_ipc > 1.8 else UOPS_EX= ECUTED.CYCLES_GE_2_UOPS_EXEC) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latenc= y > 0.1 else 0) + RESOURCE_STALLS.SB) * tma_backend_bound", "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_b= ound_group", "MetricName": "tma_memory_bound", - "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", + "MetricThreshold": "(tma_memory_bound > 0.2) & ((tma_backend_bound= > 0.2))", "MetricgroupNoGroup": "TopdownL2", "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", "ScaleUnit": "100%" @@ -883,7 +909,7 @@ "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / UOPS_ISSUED.ANY * IDQ.M= S_UOPS / tma_info_thread_slots", "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", "MetricName": "tma_microcode_sequencer", - "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", + "MetricThreshold": "(tma_microcode_sequencer > 0.05) & ((tma_heavy= _operations > 0.1))", "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_clears_reste= ers, tma_l1_bound, tma_machine_clears, tma_ms_switches", "ScaleUnit": "100%" }, @@ -892,7 +918,7 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES * tma_branch_resteers = / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "(tma_mispredicts_resteers > 0.05) & ((tma_bran= ch_resteers > 0.05) & ((tma_fetch_latency > 0.1) & ((tma_frontend_bound > 0= .15))))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_branch_mispredicts, tma_info_bad_spec_branch_misprediction_cost= ", "ScaleUnit": "100%" }, @@ -901,7 +927,7 @@ "MetricExpr": "(IDQ.ALL_MITE_CYCLES_ANY_UOPS - IDQ.ALL_MITE_CYCLES= _4_UOPS) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", "MetricName": "tma_mite", - "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", + "MetricThreshold": "(tma_mite > 0.1) & ((tma_fetch_bandwidth > 0.2= ))", "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sa= mple with: FRONTEND_RETIRED.ANY_DSB_MISS", "ScaleUnit": "100%" }, @@ -910,7 +936,7 @@ "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "(tma_ms_switches > 0.05) & ((tma_fetch_latency= > 0.1) & ((tma_frontend_bound > 0.15)))", "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mix= ing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, @@ -919,8 +945,8 @@ "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_0 / tma_info_core_core_cl= ks", "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", - "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "(tma_port_0 > 0.6)", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b,= tma_fp_vector_256b, tma_fp_vector_512b, tma_port_1, tma_port_5, tma_port_6= , tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -928,8 +954,8 @@ "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_1 / tma_info_core_core_cl= ks", "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", - "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tma_por= t_6, tma_ports_utilized_2", + "MetricThreshold": "(tma_port_1 > 0.6)", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector= _512b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -937,7 +963,7 @@ "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_2 / tma_info_core_core_cl= ks", "MetricGroup": "TopdownL6;tma_L6_group;tma_load_op_utilization_gro= up", "MetricName": "tma_port_2", - "MetricThreshold": "tma_port_2 > 0.6", + "MetricThreshold": "(tma_port_2 > 0.6)", "ScaleUnit": "100%" }, { @@ -945,7 +971,7 @@ "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_3 / tma_info_core_core_cl= ks", "MetricGroup": "TopdownL6;tma_L6_group;tma_load_op_utilization_gro= up", "MetricName": "tma_port_3", - "MetricThreshold": "tma_port_3 > 0.6", + "MetricThreshold": "(tma_port_3 > 0.6)", "ScaleUnit": "100%" }, { @@ -953,7 +979,7 @@ "MetricExpr": "tma_store_op_utilization", "MetricGroup": "TopdownL6;tma_L6_group;tma_issueSpSt;tma_store_op_= utilization_group", "MetricName": "tma_port_4", - "MetricThreshold": "tma_port_4 > 0.6", + "MetricThreshold": "(tma_port_4 > 0.6)", "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 4 (Store-data). Related metrics: t= ma_split_stores", "ScaleUnit": "100%" }, @@ -962,7 +988,7 @@ "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_5 / tma_info_core_core_cl= ks", "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_5", - "MetricThreshold": "tma_port_5 > 0.6", + "MetricThreshold": "(tma_port_5 > 0.6)", "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, t= ma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_6, = tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -971,8 +997,8 @@ "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_6 / tma_info_core_core_cl= ks", "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", - "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_6. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "MetricThreshold": "(tma_port_6 > 0.6)", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, = tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5,= tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -980,7 +1006,7 @@ "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_7 / tma_info_core_core_cl= ks", "MetricGroup": "TopdownL6;tma_L6_group;tma_store_op_utilization_gr= oup", "MetricName": "tma_port_7", - "MetricThreshold": "tma_port_7 > 0.6", + "MetricThreshold": "(tma_port_7 > 0.6)", "ScaleUnit": "100%" }, { @@ -989,7 +1015,7 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES= _GE_1_UOP_EXEC - (UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if tma_info_thread_ip= c > 1.8 else UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) - (RS_EVENTS.EMPTY_CYCLES= if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB - RESOURCE_STALLS.= SB - CYCLE_ACTIVITY.STALLS_MEM_ANY) / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", + "MetricThreshold": "(tma_ports_utilization > 0.15) & ((tma_core_bo= und > 0.1) & ((tma_backend_bound > 0.2)))", "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", "ScaleUnit": "100%" }, @@ -998,7 +1024,7 @@ "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\,cmask\\=3D1@ / 2 if= #SMT_on else (CYCLE_ACTIVITY.STALLS_TOTAL - (RS_EVENTS.EMPTY_CYCLES if tma= _fetch_latency > 0.1 else 0)) / tma_info_core_core_clks)", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_ports_utilized_0 > 0.2) & ((tma_ports_uti= lization > 0.15) & ((tma_core_bound > 0.1) & ((tma_backend_bound > 0.2))))", "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", "ScaleUnit": "100%" }, @@ -1007,7 +1033,7 @@ "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D2@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_= 1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) / tma_info_core_core_clks= )", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_ports_utilized_1 > 0.2) & ((tma_ports_uti= lization > 0.15) & ((tma_core_bound > 0.1) & ((tma_backend_bound > 0.2))))", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", "ScaleUnit": "100%" }, @@ -1016,7 +1042,7 @@ "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D3@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_= 2_UOPS_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_clk= s)", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_ports_utilized_2 > 0.15) & ((tma_ports_ut= ilization > 0.15) & ((tma_core_bound > 0.1) & ((tma_backend_bound > 0.2))))= ", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, @@ -1025,7 +1051,7 @@ "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@ / 2 if #SMT_= on else UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_ports_utilized_3m > 0.4) & ((tma_ports_ut= ilization > 0.15) & ((tma_core_bound > 0.1) & ((tma_backend_bound > 0.2))))= ", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%" }, @@ -1034,7 +1060,7 @@ "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / tma_info_thread_slots", "MetricGroup": "BvUW;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", - "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", + "MetricThreshold": "((tma_retiring > 0.7)|(tma_heavy_operations > = 0.1))", "MetricgroupNoGroup": "TopdownL1", "PublicDescription": "This category represents fraction of slots u= tilized by useful work i.e. issued uops that eventually get retired. Ideall= y; all pipeline slots would be attributed to the Retiring category. Retiri= ng of 100% would indicate the maximum Pipeline_Width throughput was achieve= d. Maximizing Retiring typically increases the Instructions-per-cycle (see= IPC metric). Note that a high Retiring value does not necessary mean there= is no room for more performance. For example; Heavy-operations or Microco= de Assists are categorized under Retiring. They often indicate suboptimal p= erformance and can often be optimized or avoided. Sample with: UOPS_RETIRED= .SLOTS", "ScaleUnit": "100%" @@ -1045,7 +1071,7 @@ "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_split_loads > 0.3)", "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", "ScaleUnit": "100%" }, @@ -1054,16 +1080,16 @@ "MetricExpr": "2 * MEM_UOPS_RETIRED.SPLIT_STORES / tma_info_core_c= ore_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_split_stores > 0.2) & ((tma_store_bound >= 0.2) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", "ScaleUnit": "100%" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on els= e OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_core_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_sq_full > 0.3) & ((tma_l3_bound > 0.05) &= ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_system_dram_bw_use, tma_mem_bandwidth", "ScaleUnit": "100%" }, @@ -1072,7 +1098,7 @@ "MetricExpr": "RESOURCE_STALLS.SB / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", + "MetricThreshold": "(tma_store_bound > 0.2) & ((tma_memory_bound >= 0.2) & ((tma_backend_bound > 0.2)))", "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", "ScaleUnit": "100%" }, @@ -1081,7 +1107,7 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_store_fwd_blk > 0.1) & ((tma_l1_bound > 0= .1) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", "ScaleUnit": "100%" }, @@ -1089,9 +1115,9 @@ "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_= LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / M= EM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS= _OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "(tma_store_latency > 0.1) & ((tma_store_bound = > 0.2) & ((tma_memory_bound > 0.2) & ((tma_backend_bound > 0.2))))", "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", "ScaleUnit": "100%" }, @@ -1100,7 +1126,7 @@ "MetricExpr": "UOPS_DISPATCHED_PORT.PORT_4 / tma_info_core_core_cl= ks", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_store_op_utilization", - "MetricThreshold": "tma_store_op_utilization > 0.6", + "MetricThreshold": "(tma_store_op_utilization > 0.6)", "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Store operations. Sample with:= UOPS_DISPATCHED.PORT_7_8", "ScaleUnit": "100%" }, @@ -1109,7 +1135,7 @@ "MetricExpr": "tma_branch_resteers - tma_mispredicts_resteers - tm= a_clears_resteers", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "(tma_unknown_branches > 0.05) & ((tma_branch_r= esteers > 0.05) & ((tma_fetch_latency > 0.1) & ((tma_frontend_bound > 0.15)= )))", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", "ScaleUnit": "100%" }, @@ -1118,7 +1144,7 @@ "MetricExpr": "INST_RETIRED.X87 * tma_info_thread_uoppi / UOPS_RET= IRED.RETIRE_SLOTS", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", + "MetricThreshold": "(tma_x87_use > 0.1) & ((tma_fp_arith > 0.2) & = ((tma_light_operations > 0.6)))", "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", "ScaleUnit": "100%" } diff --git a/tools/perf/pmu-events/arch/x86/broadwellde/cache.json b/tools/= perf/pmu-events/arch/x86/broadwellde/cache.json index 315d7f041731..49d8de8f1b51 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellde/cache.json +++ b/tools/perf/pmu-events/arch/x86/broadwellde/cache.json @@ -22,7 +22,7 @@ "Counter": "2", "EventCode": "0x48", "EventName": "L1D_PEND_MISS.PENDING", - "PublicDescription": "This event counts duration of L1D miss outst= anding, that is each cycle number of Fill Buffers (FB) outstanding required= by Demand Reads. FB either is held by demand loads, or it is held by non-d= emand loads and gets hit at least once by demand. The valid outstanding int= erval is defined until the FB deallocation by one of the following ways: fr= om FB allocation, if FB is allocated by demand; from the demand Hit FB, if = it is allocated by hardware or software prefetch.\nNote: In the L1D, a Dema= nd Read contains cacheable or noncacheable demand loads, including ones cau= sing cache-line splits and reads due to page walks resulted from any reques= t type.", + "PublicDescription": "This event counts duration of L1D miss outst= anding, that is each cycle number of Fill Buffers (FB) outstanding required= by Demand Reads. FB either is held by demand loads, or it is held by non-d= emand loads and gets hit at least once by demand. The valid outstanding int= erval is defined until the FB deallocation by one of the following ways: fr= om FB allocation, if FB is allocated by demand; from the demand Hit FB, if = it is allocated by hardware or software prefetch. Note: In the L1D, a Deman= d Read contains cacheable or noncacheable demand loads, including ones caus= ing cache-line splits and reads due to page walks resulted from any request= type.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -434,7 +434,7 @@ "EventCode": "0xD1", "EventName": "MEM_LOAD_UOPS_RETIRED.HIT_LFB", "PEBS": "1", - "PublicDescription": "This event counts retired load uops which da= ta sources were load uops missed L1 but hit a fill buffer due to a precedin= g miss to the same cache line with the data not ready.\nNote: Only two data= -sources of L1/FB are applicable for AVX-256bit even though the correspond= ing AVX load could be serviced by a deeper level in the memory hierarchy. D= ata source is reported for the Low-half load.", + "PublicDescription": "This event counts retired load uops which da= ta sources were load uops missed L1 but hit a fill buffer due to a precedin= g miss to the same cache line with the data not ready. Note: Only two data-= sources of L1/FB are applicable for AVX-256bit even though the correspondi= ng AVX load could be serviced by a deeper level in the memory hierarchy. Da= ta source is reported for the Low-half load.", "SampleAfterValue": "100003", "UMask": "0x40" }, @@ -445,7 +445,7 @@ "EventCode": "0xD1", "EventName": "MEM_LOAD_UOPS_RETIRED.L1_HIT", "PEBS": "1", - "PublicDescription": "This event counts retired load uops which da= ta sources were hits in the nearest-level (L1) cache.\nNote: Only two data-= sources of L1/FB are applicable for AVX-256bit even though the correspondi= ng AVX load could be serviced by a deeper level in the memory hierarchy. Da= ta source is reported for the Low-half load. This event also counts SW pref= etches independent of the actual data source.", + "PublicDescription": "This event counts retired load uops which da= ta sources were hits in the nearest-level (L1) cache. Note: Only two data-s= ources of L1/FB are applicable for AVX-256bit even though the correspondin= g AVX load could be serviced by a deeper level in the memory hierarchy. Dat= a source is reported for the Low-half load. This event also counts SW prefe= tches independent of the actual data source.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -634,7 +634,7 @@ "Counter": "0,1,2,3", "EventCode": "0xb2", "EventName": "OFFCORE_REQUESTS_BUFFER.SQ_FULL", - "PublicDescription": "This event counts the number of cases when t= he offcore requests buffer cannot take more entries for the core. This can = happen when the superqueue does not contain eligible entries, or when L1D w= riteback pending FIFO requests is full.\nNote: Writeback pending FIFO has s= ix entries.", + "PublicDescription": "This event counts the number of cases when t= he offcore requests buffer cannot take more entries for the core. This can = happen when the superqueue does not contain eligible entries, or when L1D w= riteback pending FIFO requests is full. Note: Writeback pending FIFO has si= x entries.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -697,7 +697,7 @@ "Errata": "BDM76", "EventCode": "0x60", "EventName": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD", - "PublicDescription": "This event counts the number of offcore outs= tanding Demand Data Read transactions in the super queue (SQ) every cycle. = A transaction is considered to be in the Offcore outstanding state between = L2 miss and transaction completion sent to requestor. See the corresponding= Umask under OFFCORE_REQUESTS.\nNote: A prefetch promoted to Demand is coun= ted from the promotion point.", + "PublicDescription": "This event counts the number of offcore outs= tanding Demand Data Read transactions in the super queue (SQ) every cycle. = A transaction is considered to be in the Offcore outstanding state between = L2 miss and transaction completion sent to requestor. See the corresponding= Umask under OFFCORE_REQUESTS. Note: A prefetch promoted to Demand is count= ed from the promotion point.", "SampleAfterValue": "2000003", "UMask": "0x1" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwellde/frontend.json b/too= ls/perf/pmu-events/arch/x86/broadwellde/frontend.json index db3488abf9fc..018020a51436 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellde/frontend.json +++ b/tools/perf/pmu-events/arch/x86/broadwellde/frontend.json @@ -12,7 +12,7 @@ "Counter": "0,1,2,3", "EventCode": "0xAB", "EventName": "DSB2MITE_SWITCHES.PENALTY_CYCLES", - "PublicDescription": "This event counts Decode Stream Buffer (DSB)= -to-MITE switch true penalty cycles. These cycles do not include uops route= d through because of the switch itself, for example, when Instruction Decod= e Queue (IDQ) pre-allocation is unavailable, or Instruction Decode Queue (I= DQ) is full. SBD-to-MITE switch true penalty cycles happen after the merge = mux (MM) receives Decode Stream Buffer (DSB) Sync-indication until receivin= g the first MITE uop. \nMM is placed before Instruction Decode Queue (IDQ) = to merge uops being fed from the MITE and Decode Stream Buffer (DSB) paths.= Decode Stream Buffer (DSB) inserts the Sync-indication whenever a Decode S= tream Buffer (DSB)-to-MITE switch occurs.\nPenalty: A Decode Stream Buffer = (DSB) hit followed by a Decode Stream Buffer (DSB) miss can cost up to six = cycles in which no uops are delivered to the IDQ. Most often, such switches= from the Decode Stream Buffer (DSB) to the legacy pipeline cost 02 cycles.= ", + "PublicDescription": "This event counts Decode Stream Buffer (DSB)= -to-MITE switch true penalty cycles. These cycles do not include uops route= d through because of the switch itself, for example, when Instruction Decod= e Queue (IDQ) pre-allocation is unavailable, or Instruction Decode Queue (I= DQ) is full. SBD-to-MITE switch true penalty cycles happen after the merge = mux (MM) receives Decode Stream Buffer (DSB) Sync-indication until receivin= g the first MITE uop. MM is placed before Instruction Decode Queue (IDQ) t= o merge uops being fed from the MITE and Decode Stream Buffer (DSB) paths. = Decode Stream Buffer (DSB) inserts the Sync-indication whenever a Decode St= ream Buffer (DSB)-to-MITE switch occurs. Penalty: A Decode Stream Buffer (D= SB) hit followed by a Decode Stream Buffer (DSB) miss can cost up to six cy= cles in which no uops are delivered to the IDQ. Most often, such switches f= rom the Decode Stream Buffer (DSB) to the legacy pipeline cost 02 cycles.", "SampleAfterValue": "2000003", "UMask": "0x2" }, @@ -212,7 +212,7 @@ "Counter": "0,1,2,3", "EventCode": "0x9C", "EventName": "IDQ_UOPS_NOT_DELIVERED.CORE", - "PublicDescription": "This event counts the number of uops not del= ivered to Resource Allocation Table (RAT) per thread adding 4 x when Resou= rce Allocation Table (RAT) is not stalled and Instruction Decode Queue (IDQ= ) delivers x uops to Resource Allocation Table (RAT) (where x belongs to {0= ,1,2,3}). Counting does not cover cases when:\n a. IDQ-Resource Allocation = Table (RAT) pipe serves the other thread;\n b. Resource Allocation Table (R= AT) is stalled for the thread (including uop drops and clear BE conditions)= ; \n c. Instruction Decode Queue (IDQ) delivers four uops.", + "PublicDescription": "This event counts the number of uops not del= ivered to Resource Allocation Table (RAT) per thread adding 4 x when Resou= rce Allocation Table (RAT) is not stalled and Instruction Decode Queue (IDQ= ) delivers x uops to Resource Allocation Table (RAT) (where x belongs to {0= ,1,2,3}). Counting does not cover cases when: a. IDQ-Resource Allocation T= able (RAT) pipe serves the other thread; b. Resource Allocation Table (RAT= ) is stalled for the thread (including uop drops and clear BE conditions); = c. Instruction Decode Queue (IDQ) delivers four uops.", "SampleAfterValue": "2000003", "UMask": "0x1" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwellde/memory.json b/tools= /perf/pmu-events/arch/x86/broadwellde/memory.json index 31a74eed2f7d..9fca9a4eeb48 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellde/memory.json +++ b/tools/perf/pmu-events/arch/x86/broadwellde/memory.json @@ -68,7 +68,7 @@ "Counter": "0,1,2,3", "EventCode": "0xc8", "EventName": "HLE_RETIRED.START", - "PublicDescription": "Number of times we entered an HLE region\n d= oes not count nested transactions.", + "PublicDescription": "Number of times we entered an HLE region do= es not count nested transactions.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -77,7 +77,7 @@ "Counter": "0,1,2,3", "EventCode": "0xC3", "EventName": "MACHINE_CLEARS.MEMORY_ORDERING", - "PublicDescription": "This event counts the number of memory order= ing Machine Clears detected. Memory Ordering Machine Clears can result from= one of the following:\n1. memory disambiguation,\n2. external snoop, or\n3= . cross SMT-HW-thread snoop (stores) hitting load buffer.", + "PublicDescription": "This event counts the number of memory order= ing Machine Clears detected. Memory Ordering Machine Clears can result from= one of the following: 1. memory disambiguation, 2. external snoop, or 3. c= ross SMT-HW-thread snoop (stores) hitting load buffer.", "SampleAfterValue": "100003", "UMask": "0x2" }, @@ -280,7 +280,7 @@ "Counter": "0,1,2,3", "EventCode": "0xc9", "EventName": "RTM_RETIRED.START", - "PublicDescription": "Number of times we entered an RTM region\n d= oes not count nested transactions.", + "PublicDescription": "Number of times we entered an RTM region do= es not count nested transactions.", "SampleAfterValue": "2000003", "UMask": "0x1" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwellde/metricgroups.json b= /tools/perf/pmu-events/arch/x86/broadwellde/metricgroups.json index 4193c90c3459..0863375bdead 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellde/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/broadwellde/metricgroups.json @@ -9,6 +9,7 @@ "BvCB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvFB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvIO": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvML": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvMP": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvMS": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", @@ -34,6 +35,7 @@ "InsType": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -51,6 +53,7 @@ "Pipeline": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "PortsUtil": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", "Power": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "Prefetches": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", "Ret": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", "Retire": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", "SMT": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -78,6 +81,7 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -103,6 +107,7 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", diff --git a/tools/perf/pmu-events/arch/x86/broadwellde/pipeline.json b/too= ls/perf/pmu-events/arch/x86/broadwellde/pipeline.json index c03f77539362..962cd07eb307 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellde/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/broadwellde/pipeline.json @@ -379,7 +379,7 @@ "BriefDescription": "Reference cycles when the core is not in halt= state.", "Counter": "Fixed counter 2", "EventName": "CPU_CLK_UNHALTED.REF_TSC", - "PublicDescription": "This event counts the number of reference cy= cles when the core is not in a halt state. The core enters the halt state w= hen it is running the HLT instruction or the MWAIT instruction. This event = is not affected by core frequency changes (for example, P states, TM2 trans= itions) but has the same incrementing frequency as the time stamp counter. = This event can approximate elapsed time while the core was not in a halt st= ate. This event has a constant ratio with the CPU_CLK_UNHALTED.REF_XCLK eve= nt. It is counted on a dedicated fixed counter, leaving the four (eight whe= n Hyperthreading is disabled) programmable counters available for other eve= nts. \nNote: On all current platforms this event stops counting during 'thr= ottling (TM)' states duty off periods the processor is 'halted'. This even= t is clocked by base clock (100 Mhz) on Sandy Bridge. The counter update is= done at a lower clock rate then the core clock the overflow status bit for= this counter may appear 'sticky'. After the counter has overflowed and so= ftware clears the overflow status bit and resets the counter to less than M= AX. The reset value to the counter is not clocked immediately so the overfl= ow status bit will flip 'high (1)' and generate another PMI (if enabled) af= ter which the reset value gets clocked into the counter. Therefore, softwar= e will get the interrupt, read the overflow status bit '1 for bit 34 while = the counter value is less than MAX. Software should ignore this case.", + "PublicDescription": "This event counts the number of reference cy= cles when the core is not in a halt state. The core enters the halt state w= hen it is running the HLT instruction or the MWAIT instruction. This event = is not affected by core frequency changes (for example, P states, TM2 trans= itions) but has the same incrementing frequency as the time stamp counter. = This event can approximate elapsed time while the core was not in a halt st= ate. This event has a constant ratio with the CPU_CLK_UNHALTED.REF_XCLK eve= nt. It is counted on a dedicated fixed counter, leaving the four (eight whe= n Hyperthreading is disabled) programmable counters available for other eve= nts. Note: On all current platforms this event stops counting during 'thro= ttling (TM)' states duty off periods the processor is 'halted'. This event= is clocked by base clock (100 Mhz) on Sandy Bridge. The counter update is = done at a lower clock rate then the core clock the overflow status bit for = this counter may appear 'sticky'. After the counter has overflowed and sof= tware clears the overflow status bit and resets the counter to less than MA= X. The reset value to the counter is not clocked immediately so the overflo= w status bit will flip 'high (1)' and generate another PMI (if enabled) aft= er which the reset value gets clocked into the counter. Therefore, software= will get the interrupt, read the overflow status bit '1 for bit 34 while t= he counter value is less than MAX. Software should ignore this case.", "SampleAfterValue": "2000003", "UMask": "0x3" }, @@ -579,7 +579,7 @@ "BriefDescription": "Instructions retired from execution.", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PublicDescription": "This event counts the number of instructions= retired from execution. For instructions that consist of multiple micro-op= s, this event counts the retirement of the last micro-op of the instruction= . Counting continues during hardware interrupts, traps, and inside interrup= t handlers. \nNotes: INST_RETIRED.ANY is counted by a designated fixed coun= ter, leaving the four (eight when Hyperthreading is disabled) programmable = counters available for other events. INST_RETIRED.ANY_P is counted by a pro= grammable counter and it is an architectural performance event. \nCounting:= Faulting executions of GETSEC/VM entry/VM Exit/MWait will not count as ret= ired instructions.", + "PublicDescription": "This event counts the number of instructions= retired from execution. For instructions that consist of multiple micro-op= s, this event counts the retirement of the last micro-op of the instruction= . Counting continues during hardware interrupts, traps, and inside interrup= t handlers. Notes: INST_RETIRED.ANY is counted by a designated fixed count= er, leaving the four (eight when Hyperthreading is disabled) programmable c= ounters available for other events. INST_RETIRED.ANY_P is counted by a prog= rammable counter and it is an architectural performance event. Counting: F= aulting executions of GETSEC/VM entry/VM Exit/MWait will not count as retir= ed instructions.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -654,7 +654,7 @@ "Counter": "0,1,2,3", "EventCode": "0x03", "EventName": "LD_BLOCKS.STORE_FORWARD", - "PublicDescription": "This event counts how many times the load op= eration got the true Block-on-Store blocking code preventing store forwardi= ng. This includes cases when:\n - preceding store conflicts with the load (= incomplete overlap);\n - store forwarding is impossible due to u-arch limit= ations;\n - preceding lock RMW operations are not forwarded;\n - store has = the no-forward bit set (uncacheable/page-split/masked stores);\n - all-bloc= king stores are used (mostly, fences and port I/O);\nand others.\nThe most = common case is a load blocked due to its address range overlapping with a p= receding smaller uncompleted store. Note: This event does not take into acc= ount cases of out-of-SW-control (for example, SbTailHit), unknown physical = STA, and cases of blocking loads on store due to being non-WB memory type o= r a lock. These cases are covered by other events.\nSee the table of not su= pported store forwards in the Optimization Guide.", + "PublicDescription": "This event counts how many times the load op= eration got the true Block-on-Store blocking code preventing store forwardi= ng. This includes cases when: - preceding store conflicts with the load (i= ncomplete overlap); - store forwarding is impossible due to u-arch limitat= ions; - preceding lock RMW operations are not forwarded; - store has the = no-forward bit set (uncacheable/page-split/masked stores); - all-blocking = stores are used (mostly, fences and port I/O); and others. The most common = case is a load blocked due to its address range overlapping with a precedin= g smaller uncompleted store. Note: This event does not take into account ca= ses of out-of-SW-control (for example, SbTailHit), unknown physical STA, an= d cases of blocking loads on store due to being non-WB memory type or a loc= k. These cases are covered by other events. See the table of not supported = store forwards in the Optimization Guide.", "SampleAfterValue": "100003", "UMask": "0x2" }, @@ -822,7 +822,7 @@ "Counter": "0,1,2,3", "EventCode": "0x5E", "EventName": "RS_EVENTS.EMPTY_CYCLES", - "PublicDescription": "This event counts cycles during which the re= servation station (RS) is empty for the thread.\nNote: In ST-mode, not acti= ve thread should drive 0. This is usually caused by severely costly branch = mispredictions, or allocator/FE issues.", + "PublicDescription": "This event counts cycles during which the re= servation station (RS) is empty for the thread. Note: In ST-mode, not activ= e thread should drive 0. This is usually caused by severely costly branch m= ispredictions, or allocator/FE issues.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -1177,7 +1177,7 @@ "Counter": "0,1,2,3", "EventCode": "0x0E", "EventName": "UOPS_ISSUED.FLAGS_MERGE", - "PublicDescription": "Number of flags-merge uops being allocated. = Such uops considered perf sensitive\n added by GSR u-arch.", + "PublicDescription": "Number of flags-merge uops being allocated. = Such uops considered perf sensitive added by GSR u-arch.", "SampleAfterValue": "2000003", "UMask": "0x10" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwellde/uncore-cache.json b= /tools/perf/pmu-events/arch/x86/broadwellde/uncore-cache.json index f5b5ae1150c3..bfc499fdfdeb 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellde/uncore-cache.json +++ b/tools/perf/pmu-events/arch/x86/broadwellde/uncore-cache.json @@ -608,7 +608,7 @@ "EventCode": "0x12", "EventName": "UNC_C_RxR_EXT_STARVED.IPQ", "PerPkg": "1", - "PublicDescription": "Counts cycles in external starvation. This = occurs when one of the ingress queues is being starved by the other queues.= ; IPQ is externally startved and therefore we are blocking the IRQ.", + "PublicDescription": "Counts cycles in external starvation. This = occurs when one of the ingress queues is being starved by the other queues.= ; IPQ is externally starved and therefore we are blocking the IRQ.", "UMask": "0x2", "Unit": "CBOX" }, @@ -1654,7 +1654,7 @@ "EventCode": "0x14", "EventName": "UNC_H_BYPASS_IMC.NOT_TAKEN", "PerPkg": "1", - "PublicDescription": "Counts the number of times when the HA was a= ble to bypass was attempted. This is a latency optimization for situations= when there is light loadings on the memory subsystem. This can be filted = by when the bypass was taken and when it was not.; Filter for transactions = that could not take the bypass.", + "PublicDescription": "Counts the number of times when the HA was a= ble to bypass was attempted. This is a latency optimization for situations= when there is light loadings on the memory subsystem. This can be filtere= d by when the bypass was taken and when it was not.; Filter for transaction= s that could not take the bypass.", "UMask": "0x2", "Unit": "HA" }, @@ -1664,7 +1664,7 @@ "EventCode": "0x14", "EventName": "UNC_H_BYPASS_IMC.TAKEN", "PerPkg": "1", - "PublicDescription": "Counts the number of times when the HA was a= ble to bypass was attempted. This is a latency optimization for situations= when there is light loadings on the memory subsystem. This can be filted = by when the bypass was taken and when it was not.; Filter for transactions = that succeeded in taking the bypass.", + "PublicDescription": "Counts the number of times when the HA was a= ble to bypass was attempted. This is a latency optimization for situations= when there is light loadings on the memory subsystem. This can be filtere= d by when the bypass was taken and when it was not.; Filter for transaction= s that succeeded in taking the bypass.", "UMask": "0x1", "Unit": "HA" }, @@ -2679,7 +2679,7 @@ "EventCode": "0x15", "EventName": "UNC_H_RPQ_CYCLES_NO_REG_CREDITS.CHN0", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high banwidth workloads should be able to make use of all of the regular b= uffers, but it will be difficult (and uncommon) to make use of both the reg= ular and special buffers at the same time. One can filter based on the mem= ory controller channel. One or more channels can be tracked at a given tim= e.; Filter for memory controller channel 0 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high bandwidth workloads should be able to make use of all of the regular = buffers, but it will be difficult (and uncommon) to make use of both the re= gular and special buffers at the same time. One can filter based on the me= mory controller channel. One or more channels can be tracked at a given ti= me.; Filter for memory controller channel 0 only.", "UMask": "0x1", "Unit": "HA" }, @@ -2689,7 +2689,7 @@ "EventCode": "0x15", "EventName": "UNC_H_RPQ_CYCLES_NO_REG_CREDITS.CHN1", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high banwidth workloads should be able to make use of all of the regular b= uffers, but it will be difficult (and uncommon) to make use of both the reg= ular and special buffers at the same time. One can filter based on the mem= ory controller channel. One or more channels can be tracked at a given tim= e.; Filter for memory controller channel 1 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high bandwidth workloads should be able to make use of all of the regular = buffers, but it will be difficult (and uncommon) to make use of both the re= gular and special buffers at the same time. One can filter based on the me= mory controller channel. One or more channels can be tracked at a given ti= me.; Filter for memory controller channel 1 only.", "UMask": "0x2", "Unit": "HA" }, @@ -2699,7 +2699,7 @@ "EventCode": "0x15", "EventName": "UNC_H_RPQ_CYCLES_NO_REG_CREDITS.CHN2", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high banwidth workloads should be able to make use of all of the regular b= uffers, but it will be difficult (and uncommon) to make use of both the reg= ular and special buffers at the same time. One can filter based on the mem= ory controller channel. One or more channels can be tracked at a given tim= e.; Filter for memory controller channel 2 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high bandwidth workloads should be able to make use of all of the regular = buffers, but it will be difficult (and uncommon) to make use of both the re= gular and special buffers at the same time. One can filter based on the me= mory controller channel. One or more channels can be tracked at a given ti= me.; Filter for memory controller channel 2 only.", "UMask": "0x4", "Unit": "HA" }, @@ -2709,7 +2709,7 @@ "EventCode": "0x15", "EventName": "UNC_H_RPQ_CYCLES_NO_REG_CREDITS.CHN3", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high banwidth workloads should be able to make use of all of the regular b= uffers, but it will be difficult (and uncommon) to make use of both the reg= ular and special buffers at the same time. One can filter based on the mem= ory controller channel. One or more channels can be tracked at a given tim= e.; Filter for memory controller channel 3 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high bandwidth workloads should be able to make use of all of the regular = buffers, but it will be difficult (and uncommon) to make use of both the re= gular and special buffers at the same time. One can filter based on the me= mory controller channel. One or more channels can be tracked at a given ti= me.; Filter for memory controller channel 3 only.", "UMask": "0x8", "Unit": "HA" }, @@ -3239,7 +3239,7 @@ "EventCode": "0x3", "EventName": "UNC_H_TRACKER_CYCLES_NE.ALL", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occpancy. HA tra= ckers are allocated as soon as a request enters the HA if an HT (Home Track= er) entry is available and is released after the snoop response and data re= turn (or post in the case of a write) and the response is returned on the r= ing.; Requests coming from both local and remote sockets.", + "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occupancy. HA tr= ackers are allocated as soon as a request enters the HA if an HT (Home Trac= ker) entry is available and is released after the snoop response and data r= eturn (or post in the case of a write) and the response is returned on the = ring.; Requests coming from both local and remote sockets.", "UMask": "0x3", "Unit": "HA" }, @@ -3249,7 +3249,7 @@ "EventCode": "0x3", "EventName": "UNC_H_TRACKER_CYCLES_NE.LOCAL", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occpancy. HA tra= ckers are allocated as soon as a request enters the HA if an HT (Home Track= er) entry is available and is released after the snoop response and data re= turn (or post in the case of a write) and the response is returned on the r= ing.; This filter includes only requests coming from the local socket.", + "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occupancy. HA tr= ackers are allocated as soon as a request enters the HA if an HT (Home Trac= ker) entry is available and is released after the snoop response and data r= eturn (or post in the case of a write) and the response is returned on the = ring.; This filter includes only requests coming from the local socket.", "UMask": "0x1", "Unit": "HA" }, @@ -3259,7 +3259,7 @@ "EventCode": "0x3", "EventName": "UNC_H_TRACKER_CYCLES_NE.REMOTE", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occpancy. HA tra= ckers are allocated as soon as a request enters the HA if an HT (Home Track= er) entry is available and is released after the snoop response and data re= turn (or post in the case of a write) and the response is returned on the r= ing.; This filter includes only requests coming from remote sockets.", + "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occupancy. HA tr= ackers are allocated as soon as a request enters the HA if an HT (Home Trac= ker) entry is available and is released after the snoop response and data r= eturn (or post in the case of a write) and the response is returned on the = ring.; This filter includes only requests coming from remote sockets.", "UMask": "0x2", "Unit": "HA" }, @@ -3679,7 +3679,7 @@ "EventCode": "0x18", "EventName": "UNC_H_WPQ_CYCLES_NO_REG_CREDITS.CHN0", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high banwidth workloads should be able to make use of all of the regu= lar buffers, but it will be difficult (and uncommon) to make use of both th= e regular and special buffers at the same time. One can filter based on th= e memory controller channel. One or more channels can be tracked at a give= n time.; Filter for memory controller channel 0 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high bandwidth workloads should be able to make use of all of the reg= ular buffers, but it will be difficult (and uncommon) to make use of both t= he regular and special buffers at the same time. One can filter based on t= he memory controller channel. One or more channels can be tracked at a giv= en time.; Filter for memory controller channel 0 only.", "UMask": "0x1", "Unit": "HA" }, @@ -3689,7 +3689,7 @@ "EventCode": "0x18", "EventName": "UNC_H_WPQ_CYCLES_NO_REG_CREDITS.CHN1", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high banwidth workloads should be able to make use of all of the regu= lar buffers, but it will be difficult (and uncommon) to make use of both th= e regular and special buffers at the same time. One can filter based on th= e memory controller channel. One or more channels can be tracked at a give= n time.; Filter for memory controller channel 1 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high bandwidth workloads should be able to make use of all of the reg= ular buffers, but it will be difficult (and uncommon) to make use of both t= he regular and special buffers at the same time. One can filter based on t= he memory controller channel. One or more channels can be tracked at a giv= en time.; Filter for memory controller channel 1 only.", "UMask": "0x2", "Unit": "HA" }, @@ -3699,7 +3699,7 @@ "EventCode": "0x18", "EventName": "UNC_H_WPQ_CYCLES_NO_REG_CREDITS.CHN2", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high banwidth workloads should be able to make use of all of the regu= lar buffers, but it will be difficult (and uncommon) to make use of both th= e regular and special buffers at the same time. One can filter based on th= e memory controller channel. One or more channels can be tracked at a give= n time.; Filter for memory controller channel 2 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high bandwidth workloads should be able to make use of all of the reg= ular buffers, but it will be difficult (and uncommon) to make use of both t= he regular and special buffers at the same time. One can filter based on t= he memory controller channel. One or more channels can be tracked at a giv= en time.; Filter for memory controller channel 2 only.", "UMask": "0x4", "Unit": "HA" }, @@ -3709,7 +3709,7 @@ "EventCode": "0x18", "EventName": "UNC_H_WPQ_CYCLES_NO_REG_CREDITS.CHN3", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high banwidth workloads should be able to make use of all of the regu= lar buffers, but it will be difficult (and uncommon) to make use of both th= e regular and special buffers at the same time. One can filter based on th= e memory controller channel. One or more channels can be tracked at a give= n time.; Filter for memory controller channel 3 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high bandwidth workloads should be able to make use of all of the reg= ular buffers, but it will be difficult (and uncommon) to make use of both t= he regular and special buffers at the same time. One can filter based on t= he memory controller channel. One or more channels can be tracked at a giv= en time.; Filter for memory controller channel 3 only.", "UMask": "0x8", "Unit": "HA" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwellde/uncore-interconnect= .json b/tools/perf/pmu-events/arch/x86/broadwellde/uncore-interconnect.json index 58031f397168..5ccc49cb66a6 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellde/uncore-interconnect.json +++ b/tools/perf/pmu-events/arch/x86/broadwellde/uncore-interconnect.json @@ -33,7 +33,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.CLFLUSH", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x80", "Unit": "IRP" }, @@ -43,7 +43,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.CRD", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x2", "Unit": "IRP" }, @@ -53,7 +53,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.DRD", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x4", "Unit": "IRP" }, @@ -63,7 +63,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.PCIDCAHINT", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x20", "Unit": "IRP" }, @@ -73,7 +73,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.PCIRDCUR", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x1", "Unit": "IRP" }, @@ -83,7 +83,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.PCITOM", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x10", "Unit": "IRP" }, @@ -93,7 +93,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.RFO", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x8", "Unit": "IRP" }, @@ -103,7 +103,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.WBMTOI", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x40", "Unit": "IRP" }, diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 18f3fa1f1dab..e4f4d727ca04 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -4,7 +4,7 @@ GenuineIntel-6-BE,v1.28,alderlaken,core GenuineIntel-6-C[56],v1.06,arrowlake,core GenuineIntel-6-(1C|26|27|35|36),v5,bonnell,core GenuineIntel-6-(3D|47),v30,broadwell,core -GenuineIntel-6-56,v11,broadwellde,core +GenuineIntel-6-56,v12,broadwellde,core GenuineIntel-6-4F,v22,broadwellx,core GenuineIntel-6-55-[56789ABCDEF],v1.22,cascadelakex,core GenuineIntel-6-9[6C],v1.05,elkhartlake,core --=20 2.48.0.rc2.279.g1de40edade-goog From nobody Sun Dec 14 13:49:51 2025 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 53C7D192598 for ; Thu, 16 Jan 2025 06:44:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009878; cv=none; b=H/sckFSnUkKZ6rIkBLlapUDjwctelne2mHLN7wnfXa/hrAFhdMYHMhUk7xzp+j1vDhOML8givqKe+bI7KdWbN9NeojBMS/5zcUmq+q2oDeb4BSxDdus3spd1xru4wha7Ju02bw0DJ7/uiUv+Yv6EweSx2ZE1cMBThoec6Cq1278= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009878; c=relaxed/simple; bh=/Eh1u6RjgKZbqrfvdj4YVmk9GmUdngcuywRFUxOz260=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=R02xOmCfavyB5+xDAj8ymY36K1YvdD/5cubQM6D3f7dUnoJ3sC4KgZpUg57Ss1JSsmGJBzQOrjWUApNT0mX0BPIp7iacOEnB48w85A8W6oa9PkWexFw2NWPxDD6+4TAHYmRwwZ9+pySRJZOpaAOZgcM53dgN9X67+bnsvSgCHxo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=XlGo3iYI; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="XlGo3iYI" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-e572f6dee18so1825379276.0 for ; Wed, 15 Jan 2025 22:44:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737009858; x=1737614658; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=UzQWIDTTcBAmsfHVCClTk7hKYUSN44x1F8E+ipNXgm4=; b=XlGo3iYIOblySiRCqb0HhrBCwZYTIxxjmi0EMtbM7j/6LLRiErjCFMDj/GwAUFR1SW S7uTWKy3+MP8prAg9KTQrD9XO8t/WMHUVpOvgf0rND+xC9YqOHZw0teA7HqPa2+66Ars xZfdI4IBPFZikRFj/m7w2dD1ET3jJi9sCE8htDsxzdqonWBx5KRc8L3px0JQ5LmWAg1Z MjAmrrPjoylBoS/uKBcIIqHJlXfbz+nL25wSdtehxayXnYkVrwbnhj2b+uCSnMCPBZ8g kF5POnEsrkK0llhYNDwLr2VpPpD9EEjGm6XkN+gzYTvN8PBUoE8MwWX103e4y/X9NAuV N57w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737009858; x=1737614658; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=UzQWIDTTcBAmsfHVCClTk7hKYUSN44x1F8E+ipNXgm4=; b=Km00P4LKaPZBRaQzWpQnwnBC0FbBNcGdkCHOp4C/8pf6EHKFAqEecz16c8RR/5ZyUw Ot02BYJVzR8FAm4P0+j3S/9GEZ06iYa20dZtnu0ymDbkJhGGtEQ7aJTertG2AN6X5Jxz jSFN42pFZY1+fgEV/19vP6iVmJpVRcJxcyBdd0PUhI1fLc2cSRdtZDrziZE/d0V3ix2q X/qObVx4bnGE1cx6FCZyYmxBUWMLQ79UTQQIEor3sp7coBwoM4IG8wLi3XNv/Ttw+ZOD Z+f1aaClrvAPYVSgAv06mWjjftgt0HFxFGiXjPwKc4MP1s7wGcQc0sfIFJCY6Y+31XtM lW7g== X-Forwarded-Encrypted: i=1; AJvYcCXZd1g87YB5qUxEXAmqFz6HHYr+9T2Nis/sVDMvEb0SkJCw4b6RHc/hCe0DzRzUE6u7nlWRJe0gFQqMwPU=@vger.kernel.org X-Gm-Message-State: AOJu0Yx6HszaNBXrEMG1jXGw/YdXR0IoZFWCGq8/IFNRKlAL0Xlvl0wy wFy6q9EfobdVvGgH7mcVtY8Iobu0pTgIRKUdI+cRNoQ5dEXqd/T6tuYXTncFN4fFd1k2wEwXLiW oGfEjGQ== X-Google-Smtp-Source: AGHT+IFROxBUBaG9YBVKiXP+Bi5EtFK9WeGO7B8+lCaLAl+F2vJiTtAt17z9pgkToG4CyaaYiRlxgZJP/CdX X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:7bd:ce19:c701:bfa3]) (user=irogers job=sendgmr) by 2002:a81:ff08:0:b0:6ee:4d97:9091 with SMTP id 00721157ae682-6f531353af5mr789117b3.7.1737009858291; Wed, 15 Jan 2025 22:44:18 -0800 (PST) Date: Wed, 15 Jan 2025 22:43:38 -0800 In-Reply-To: <20250116064355.344245-1-irogers@google.com> Message-Id: <20250116064355.344245-7-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250116064355.344245-1-irogers@google.com> X-Mailer: git-send-email 2.48.0.rc2.279.g1de40edade-goog Subject: [PATCH v2 06/23] perf vendor events: Update BroadwellX events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v22 to v23. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v23: https://github.com/intel/perfmon/commit/679982113f4bfa16cee19d5408a7f8e309e= 3ac23 The TMA 5.01 update is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-developed-by: Caleb Biggers Signed-off-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../arch/x86/broadwellx/bdx-metrics.json | 344 ++++++++++-------- .../pmu-events/arch/x86/broadwellx/cache.json | 10 +- .../arch/x86/broadwellx/frontend.json | 4 +- .../arch/x86/broadwellx/memory.json | 6 +- .../arch/x86/broadwellx/metricgroups.json | 5 + .../arch/x86/broadwellx/pipeline.json | 10 +- .../arch/x86/broadwellx/uncore-cache.json | 28 +- .../x86/broadwellx/uncore-interconnect.json | 36 +- .../arch/x86/broadwellx/uncore-memory.json | 1 + tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 10 files changed, 230 insertions(+), 216 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/bdx-metrics.json b/t= ools/perf/pmu-events/arch/x86/broadwellx/bdx-metrics.json index 0577d7460082..8016202bad1f 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/bdx-metrics.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/bdx-metrics.json @@ -55,7 +55,7 @@ "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles.", + "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles", "MetricExpr": "CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY", "MetricName": "cpi", "ScaleUnit": "1per_instr" @@ -76,24 +76,24 @@ "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data loads to the total number of complete= d instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "dtlb_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data stores to the total number of complet= ed instructions", "MetricExpr": "DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", "MetricName": "dtlb_store_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB", "ScaleUnit": "1per_instr" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU.", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU", "MetricExpr": "cbox@UNC_C_TOR_INSERTS.OPCODE\\,filter_opc\\=3D0x19= e@ * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU", "MetricExpr": "(cbox@UNC_C_TOR_INSERTS.OPCODE\\,filter_opc\\=3D0x1= c8\\,filter_tid\\=3D0x3e@ + cbox@UNC_C_TOR_INSERTS.OPCODE\\,filter_opc\\=3D= 0x180\\,filter_tid\\=3D0x3e@) * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_write", "ScaleUnit": "1MB/s" @@ -102,14 +102,14 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total n= umber of completed instructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY= ", "MetricName": "itlb_large_page_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by a code fetch to the total number of completed ins= tructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "itlb_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB", "ScaleUnit": "1per_instr" }, { @@ -209,13 +209,13 @@ "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches.", + "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches", "MetricExpr": "cbox@UNC_C_TOR_INSERTS.MISS_LOCAL_OPCODE\\,filter_o= pc\\=3D0x182@ / (cbox@UNC_C_TOR_INSERTS.MISS_LOCAL_OPCODE\\,filter_opc\\=3D= 0x182@ + cbox@UNC_C_TOR_INSERTS.MISS_REMOTE_OPCODE\\,filter_opc\\=3D0x182@)= ", "MetricName": "numa_reads_addressed_to_local_dram", "ScaleUnit": "100%" }, { - "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches.", + "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches", "MetricExpr": "cbox@UNC_C_TOR_INSERTS.MISS_REMOTE_OPCODE\\,filter_= opc\\=3D0x182@ / (cbox@UNC_C_TOR_INSERTS.MISS_LOCAL_OPCODE\\,filter_opc\\= =3D0x182@ + cbox@UNC_C_TOR_INSERTS.MISS_REMOTE_OPCODE\\,filter_opc\\=3D0x18= 2@)", "MetricName": "numa_reads_addressed_to_remote_dram", "ScaleUnit": "100%" @@ -276,12 +276,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", + "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT= .PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_= info_thread_slots", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", @@ -294,8 +294,8 @@ "MetricExpr": "66 * OTHER_ASSISTS.ANY_WB_ASSIST / tma_info_thread_= slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y_WB_ASSIST", "ScaleUnit": "100%" }, { @@ -306,7 +306,7 @@ "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound.", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound", "ScaleUnit": "100%" }, { @@ -316,7 +316,7 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%" }, { @@ -327,7 +327,7 @@ "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_info_bad_spec_branch_misprediction_cost, tma_mispredicts_resteers", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_mispredicts_resteers", "ScaleUnit": "100%" }, { @@ -335,8 +335,8 @@ "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS= .COUNT + BACLEARS.ANY) / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -345,8 +345,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", "ScaleUnit": "100%" }, { @@ -354,7 +354,7 @@ "MetricExpr": "MACHINE_CLEARS.COUNT * tma_branch_resteers / (BR_MI= SP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Rel= ated metrics: tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tm= a_ms_switches", "ScaleUnit": "100%" }, @@ -362,10 +362,10 @@ "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(60 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM * (1 = + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_= UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS= _L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LO= AD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_D= RAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RET= IRED.REMOTE_FWD))) + 43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS * (1 + ME= M_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS= _RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_= HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_U= OPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM = + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED= .REMOTE_FWD)))) / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MIS= S. Related metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears= , tma_remote_cache", "ScaleUnit": "100%" }, { @@ -376,7 +376,7 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", "ScaleUnit": "100%" }, { @@ -385,8 +385,8 @@ "MetricExpr": "43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT * (1 + = MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UO= PS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L= 3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD= _UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRA= M + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIR= ED.REMOTE_FWD))) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_UOPS_L3_HIT_RETIRED.XSNP_HIT. Related metrics: tma_contested_accesses, t= ma_false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { @@ -394,8 +394,8 @@ "MetricExpr": "ARITH.FPU_DIV_ACTIVE / tma_info_core_core_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", - "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_UOPS", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.FPU_DIV_ACTIVE", "ScaleUnit": "100%" }, { @@ -404,8 +404,8 @@ "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_= RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS)) * CYCLE_ACTIVITY.STALL= S_L2_MISS / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RE= TIRED.L3_MISS_PS", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RE= TIRED.L3_MISS", "ScaleUnit": "100%" }, { @@ -414,7 +414,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", "ScaleUnit": "100%" }, { @@ -422,45 +422,45 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Related metrics: tma_fetch_bandw= idth, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + cpu@DTLB_LOAD_MISS= ES.WALK_DURATION\\,cmask\\=3D1@ + 7 * DTLB_LOAD_MISSES.WALK_COMPLETED) / tm= a_info_thread_clks", + "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + cpu@DTLB_LOAD_MISS= ES.WALK_DURATION\\,cmask\\=3D0x1@ + 7 * DTLB_LOAD_MISSES.WALK_COMPLETED) / = tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(8 * DTLB_STORE_MISSES.STLB_HIT + cpu@DTLB_STORE_MI= SSES.WALK_DURATION\\,cmask\\=3D1@ + 7 * DTLB_STORE_MISSES.WALK_COMPLETED) /= tma_info_thread_clks", + "MetricExpr": "(8 * DTLB_STORE_MISSES.STLB_HIT + cpu@DTLB_STORE_MI= SSES.WALK_DURATION\\,cmask\\=3D0x1@ + 7 * DTLB_STORE_MISSES.WALK_COMPLETED)= / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_UOPS_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_UOPS_RETIRED.STLB_MISS_STORES. Related metrics: tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", "MetricExpr": "(200 * OFFCORE_RESPONSE.DEMAND_RFO.LLC_MISS.REMOTE_= HITM + 60 * OFFCORE_RESPONSE.DEMAND_RFO.LLC_HIT.HITM_OTHER_CORE) / tma_info= _thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related= metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_UOPS_L3= _HIT_RETIRED.XSNP_HITM, MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM, OFFCORE_= RESPONSE.DEMAND_RFO.LLC_HIT.HITM_OTHER_CORE, OFFCORE_RESPONSE.DEMAND_RFO.LL= C_MISS.REMOTE_HITM. Related metrics: tma_contested_accesses, tma_data_shari= ng, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D1@ / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D0x1@ / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_system_dram_bw_use, tma_mem_ba= ndwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_system_dram_bw_use, tma_mem_ba= ndwidth, tma_sq_full, tma_store_latency", "ScaleUnit": "100%" }, { @@ -489,7 +489,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", "ScaleUnit": "100%" }, { @@ -497,8 +497,8 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports= _utilized_2", "ScaleUnit": "100%" }, { @@ -506,8 +506,8 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -515,8 +515,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_p= ort_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -524,8 +524,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_port_0, tma_port_1, tma_port_5, tma_p= ort_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -535,33 +535,33 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound.", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", "MetricExpr": "tma_microcode_sequencer", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses.", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", "MetricExpr": "ICACHE.IFDATA_STALL / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", "MetricExpr": "tma_info_inst_mix_instructions / (UOPS_RETIRED.RETI= RE_SLOTS / UOPS_ISSUED.ANY * BR_MISP_EXEC.INDIRECT)", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" }, { "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", @@ -572,7 +572,7 @@ }, { "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTE= D.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CP= U_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_thread_clks))", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tm= a_info_thread_clks)", "MetricGroup": "SMT", "MetricName": "tma_info_core_core_clks" }, @@ -593,11 +593,11 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED.VECTOR) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -616,7 +616,13 @@ "MetricName": "tma_info_frontend_ipunknown_branch" }, { - "BriefDescription": "Branch instructions per taken branch.", + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, + { + "BriefDescription": "Branch instructions per taken branch", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -634,7 +640,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -642,7 +648,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -650,7 +656,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -658,7 +664,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -666,7 +672,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -708,7 +714,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 9", + "MetricThreshold": "tma_info_inst_mix_iptb < 4 * 2 + 1", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_frontend_dsb_coverage, = tma_lcp" }, { @@ -731,7 +737,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw" }, @@ -743,7 +749,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw" }, @@ -785,7 +791,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw" }, @@ -804,7 +810,7 @@ { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency" }, { @@ -837,19 +843,19 @@ }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (cpu@UOPS_EXECUTED.CORE\\,cm= ask\\=3D1@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC)", + "MetricExpr": "UOPS_EXECUTED.THREAD / (cpu@UOPS_EXECUTED.CORE\\,cm= ask\\=3D0x1@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -867,14 +873,14 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / duration_time", + "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / tma_info_system_time", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_mem_bandwidth,= tma_sq_full" }, { "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / duration_time", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / tma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" @@ -884,13 +890,14 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi" + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -901,18 +908,31 @@ }, { "BriefDescription": "Average number of parallel data read requests= to external memory", - "MetricExpr": "UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=3D0x18= 2@ / UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=3D0x182\\,thresh\\=3D1@", + "MetricExpr": "cbox@UNC_C_TOR_OCCUPANCY.MISS_OPCODE\\,filter_opc\\= =3D0x182@ / cbox@UNC_C_TOR_OCCUPANCY.MISS_OPCODE\\,filter_opc\\=3D0x182@", "MetricGroup": "Mem;MemoryBW;SoC", "MetricName": "tma_info_system_mem_parallel_reads", "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches" }, { "BriefDescription": "Average latency of data read request to exter= nal memory (in nanoseconds)", - "MetricExpr": "1e9 * (UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\= =3D0x182@ / UNC_C_TOR_INSERTS.MISS_OPCODE@filter_opc\\=3D0x182@) / (tma_inf= o_system_socket_clks / duration_time)", + "MetricExpr": "1e9 * (cbox@UNC_C_TOR_OCCUPANCY.MISS_OPCODE\\,filte= r_opc\\=3D0x182@ / cbox@UNC_C_TOR_INSERTS.MISS_OPCODE\\,filter_opc\\=3D0x18= 2@) / (tma_info_system_socket_clks / tma_info_system_time)", "MetricGroup": "Mem;MemoryLat;SoC", "MetricName": "tma_info_system_mem_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetche= s. ([RKL+]memory-controller only)" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "(power@energy\\-pkg@ * 61 + 15.6 * power@energy\\-r= am@) / (duration_time * 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" + }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_= UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", @@ -925,6 +945,13 @@ "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -933,12 +960,12 @@ }, { "BriefDescription": "Measured Average Uncore Frequency for the SoC= [GHz]", - "MetricExpr": "tma_info_system_socket_clks / 1e9 / duration_time", + "MetricExpr": "tma_info_system_socket_clks / 1e9 / tma_info_system= _time", "MetricGroup": "SoC", "MetricName": "tma_info_system_uncore_frequency" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -947,14 +974,15 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi" + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -980,24 +1008,24 @@ "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TA= KEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 6" + "MetricThreshold": "tma_info_thread_uptb < 4 * 1.5" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", - "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + cpu@ITLB_MISSES.WALK_D= URATION\\,cmask\\=3D1@ + 7 * ITLB_MISSES.WALK_COMPLETED) / tma_info_thread_= clks", + "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + cpu@ITLB_MISSES.WALK_D= URATION\\,cmask\\=3D0x1@ + 7 * ITLB_MISSES.WALK_COMPLETED) / tma_info_threa= d_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: ITLB_M= ISSES.WALK_COMPLETED", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT_PS;MEM_LOAD_UOPS_= RETIRED.HIT_LFB_PS. Related metrics: tma_clears_resteers, tma_machine_clear= s, tma_microcode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT. Related metri= cs: tma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_m= s_switches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { @@ -1005,8 +1033,8 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.ST= ALLS_L2_MISS) / tma_info_thread_clks", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_UOPS_RETIRED.L2_HIT_PS", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_UOPS_RETIRED.L2_HIT", "ScaleUnit": "100%" }, { @@ -1015,8 +1043,8 @@ "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIR= ED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS) * CYCLE_ACTIVITY.STALLS_L2_M= ISS / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT", "ScaleUnit": "100%" }, { @@ -1025,8 +1053,8 @@ "MetricExpr": "41 * (MEM_LOAD_UOPS_RETIRED.L3_HIT * (1 + MEM_LOAD_= UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRE= D.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RET= IRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_= MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_L= OAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE= _FWD))) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS. Related metric= s: tma_mem_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT. Related metrics: = tma_branch_resteers, tma_mem_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -1034,18 +1062,18 @@ "MetricExpr": "ILD_STALL.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_frontend_dsb_coverage,= tma_info_inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", "MetricExpr": "tma_retiring - tma_heavy_operations", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", "ScaleUnit": "100%" }, { @@ -1063,18 +1091,18 @@ "MetricExpr": "200 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM * (= 1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOA= D_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UO= PS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_= LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE= _DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_R= ETIRED.REMOTE_FWD))) / tma_info_thread_clks", "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_grou= p", "MetricName": "tma_local_mem", - "MetricThreshold": "tma_local_mem > 0.1 & (tma_mem_latency > 0.1 &= (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)= ))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_L3= _MISS_RETIRED.LOCAL_DRAM_PS", + "MetricThreshold": "tma_local_mem > 0.1 & tma_mem_latency > 0.1 & = tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_L3= _MISS_RETIRED.LOCAL_DRAM", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_= STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_W= ITH_DEMAND_RFO) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOA= DS_PS. Related metrics: tma_store_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, { @@ -1090,10 +1118,10 @@ }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_system_dram_bw_u= se, tma_sq_full", "ScaleUnit": "100%" }, @@ -1102,7 +1130,7 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_l3_hit_latency", "ScaleUnit": "100%" }, @@ -1114,7 +1142,7 @@ "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", "ScaleUnit": "100%" }, { @@ -1131,8 +1159,8 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES * tma_branch_resteers = / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Related metrics: tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Related metrics: tma_branch_mispredicts", "ScaleUnit": "100%" }, { @@ -1141,7 +1169,7 @@ "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", "MetricName": "tma_mite", "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck", "ScaleUnit": "100%" }, { @@ -1149,8 +1177,8 @@ "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mix= ing_vectors, tma_serializing_operation", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer", "ScaleUnit": "100%" }, { @@ -1159,7 +1187,7 @@ "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_= scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vecto= r_512b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_= scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_1, = tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1168,7 +1196,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vect= or_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tm= a_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vect= or_128b, tma_fp_vector_256b, tma_port_0, tma_port_5, tma_port_6, tma_ports_= utilized_2", "ScaleUnit": "100%" }, { @@ -1204,7 +1232,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_5", "MetricThreshold": "tma_port_5 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar,= tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_5. Related metrics: tma_fp_sc= alar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tm= a_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1213,7 +1241,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_6. Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector= _512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_1. Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, t= ma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1231,43 +1259,43 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES= _GE_1_UOP_EXEC - (UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if tma_info_thread_ip= c > 1.8 else UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) - (RS_EVENTS.EMPTY_CYCLES= if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB - RESOURCE_STALLS.= SB - CYCLE_ACTIVITY.STALLS_MEM_ANY) / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\,cmask\\=3D1@ / 2 if= #SMT_on else (CYCLE_ACTIVITY.STALLS_TOTAL - (RS_EVENTS.EMPTY_CYCLES if tma= _fetch_latency > 0.1 else 0)) / tma_info_core_core_clks)", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\=3D0x1\\,cmask\\=3D0= x1@ / 2 if #SMT_on else CYCLE_ACTIVITY.STALLS_TOTAL - (RS_EVENTS.EMPTY_CYCL= ES if tma_fetch_latency > 0.1 else 0)) / tma_info_core_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D2@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_= 1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) / tma_info_core_core_clks= )", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@ - cpu@UOP= S_EXECUTED.CORE\\,cmask\\=3D0x2@) / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_= GE_1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) / tma_info_core_core_c= lks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D3@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_= 2_UOPS_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_clk= s)", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@ - cpu@UOP= S_EXECUTED.CORE\\,cmask\\=3D0x3@) / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_= GE_2_UOPS_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_= clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port= _6", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise).", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@ / 2 if #SMT_= on else UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_clks", + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@ / 2 if #SM= T_on else UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1276,8 +1304,8 @@ "MetricExpr": "(200 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM *= (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_L= OAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_= UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + ME= M_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMO= TE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS= _RETIRED.REMOTE_FWD))) + 180 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD * = (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LO= AD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_U= OPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM= _LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOT= E_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_= RETIRED.REMOTE_FWD)))) / tma_info_thread_clks", "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_is= sueSyncxn;tma_mem_latency_group", "MetricName": "tma_remote_cache", - "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0= .1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > = 0.2)))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_UOPS_L3_M= ISS_RETIRED.REMOTE_HITM_PS;MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD_PS. Rel= ated metrics: tma_contested_accesses, tma_data_sharing, tma_false_sharing, = tma_machine_clears", + "MetricThreshold": "tma_remote_cache > 0.05 & tma_mem_latency > 0.= 1 & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2= ", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. Sample with: MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM= , MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD. Related metrics: tma_contested_= accesses, tma_data_sharing, tma_false_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { @@ -1285,8 +1313,8 @@ "MetricExpr": "310 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM * = (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LO= AD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_U= OPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM= _LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOT= E_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_= RETIRED.REMOTE_FWD))) / tma_info_thread_clks", "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latenc= y_group", "MetricName": "tma_remote_mem", - "MetricThreshold": "tma_remote_mem > 0.1 & (tma_mem_latency > 0.1 = & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. #link to NUMA article. Sample= with: MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM_PS", + "MetricThreshold": "tma_remote_mem > 0.1 & tma_mem_latency > 0.1 &= tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. Sample with: MEM_LOAD_UOPS_L3= _MISS_RETIRED.REMOTE_DRAM", "ScaleUnit": "100%" }, { @@ -1305,8 +1333,8 @@ "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS_PS", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS", "ScaleUnit": "100%" }, { @@ -1314,16 +1342,16 @@ "MetricExpr": "2 * MEM_UOPS_RETIRED.SPLIT_STORES / tma_info_core_c= ore_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_UOPS_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_UOPS_RETIRED.SPLIT_STORES. Related metrics: tma_port_4", "ScaleUnit": "100%" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on els= e OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_core_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_system_dram_bw_use, tma_mem_bandwidth", "ScaleUnit": "100%" }, @@ -1332,8 +1360,8 @@ "MetricExpr": "RESOURCE_STALLS.SB / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_UOPS_RETIRED.ALL_STORES_PS", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_UOPS_RETIRED.ALL_STORES", "ScaleUnit": "100%" }, { @@ -1341,18 +1369,18 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_= LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / M= EM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS= _OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -1368,7 +1396,7 @@ "MetricExpr": "tma_branch_resteers - tma_mispredicts_resteers - tm= a_clears_resteers", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, @@ -1377,8 +1405,8 @@ "MetricExpr": "INST_RETIRED.X87 * tma_info_thread_uoppi / UOPS_RET= IRED.RETIRE_SLOTS", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", "ScaleUnit": "100%" }, { diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/cache.json b/tools/p= erf/pmu-events/arch/x86/broadwellx/cache.json index beeda41b428a..59bc8c71487e 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/cache.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/cache.json @@ -22,7 +22,7 @@ "Counter": "2", "EventCode": "0x48", "EventName": "L1D_PEND_MISS.PENDING", - "PublicDescription": "This event counts duration of L1D miss outst= anding, that is each cycle number of Fill Buffers (FB) outstanding required= by Demand Reads. FB either is held by demand loads, or it is held by non-d= emand loads and gets hit at least once by demand. The valid outstanding int= erval is defined until the FB deallocation by one of the following ways: fr= om FB allocation, if FB is allocated by demand; from the demand Hit FB, if = it is allocated by hardware or software prefetch.\nNote: In the L1D, a Dema= nd Read contains cacheable or noncacheable demand loads, including ones cau= sing cache-line splits and reads due to page walks resulted from any reques= t type.", + "PublicDescription": "This event counts duration of L1D miss outst= anding, that is each cycle number of Fill Buffers (FB) outstanding required= by Demand Reads. FB either is held by demand loads, or it is held by non-d= emand loads and gets hit at least once by demand. The valid outstanding int= erval is defined until the FB deallocation by one of the following ways: fr= om FB allocation, if FB is allocated by demand; from the demand Hit FB, if = it is allocated by hardware or software prefetch. Note: In the L1D, a Deman= d Read contains cacheable or noncacheable demand loads, including ones caus= ing cache-line splits and reads due to page walks resulted from any request= type.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -434,7 +434,7 @@ "EventCode": "0xD1", "EventName": "MEM_LOAD_UOPS_RETIRED.HIT_LFB", "PEBS": "1", - "PublicDescription": "This event counts retired load uops which da= ta sources were load uops missed L1 but hit a fill buffer due to a precedin= g miss to the same cache line with the data not ready.\nNote: Only two data= -sources of L1/FB are applicable for AVX-256bit even though the correspond= ing AVX load could be serviced by a deeper level in the memory hierarchy. D= ata source is reported for the Low-half load.", + "PublicDescription": "This event counts retired load uops which da= ta sources were load uops missed L1 but hit a fill buffer due to a precedin= g miss to the same cache line with the data not ready. Note: Only two data-= sources of L1/FB are applicable for AVX-256bit even though the correspondi= ng AVX load could be serviced by a deeper level in the memory hierarchy. Da= ta source is reported for the Low-half load.", "SampleAfterValue": "100003", "UMask": "0x40" }, @@ -445,7 +445,7 @@ "EventCode": "0xD1", "EventName": "MEM_LOAD_UOPS_RETIRED.L1_HIT", "PEBS": "1", - "PublicDescription": "This event counts retired load uops which da= ta sources were hits in the nearest-level (L1) cache.\nNote: Only two data-= sources of L1/FB are applicable for AVX-256bit even though the correspondi= ng AVX load could be serviced by a deeper level in the memory hierarchy. Da= ta source is reported for the Low-half load. This event also counts SW pref= etches independent of the actual data source.", + "PublicDescription": "This event counts retired load uops which da= ta sources were hits in the nearest-level (L1) cache. Note: Only two data-s= ources of L1/FB are applicable for AVX-256bit even though the correspondin= g AVX load could be serviced by a deeper level in the memory hierarchy. Dat= a source is reported for the Low-half load. This event also counts SW prefe= tches independent of the actual data source.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -634,7 +634,7 @@ "Counter": "0,1,2,3", "EventCode": "0xb2", "EventName": "OFFCORE_REQUESTS_BUFFER.SQ_FULL", - "PublicDescription": "This event counts the number of cases when t= he offcore requests buffer cannot take more entries for the core. This can = happen when the superqueue does not contain eligible entries, or when L1D w= riteback pending FIFO requests is full.\nNote: Writeback pending FIFO has s= ix entries.", + "PublicDescription": "This event counts the number of cases when t= he offcore requests buffer cannot take more entries for the core. This can = happen when the superqueue does not contain eligible entries, or when L1D w= riteback pending FIFO requests is full. Note: Writeback pending FIFO has si= x entries.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -697,7 +697,7 @@ "Errata": "BDM76", "EventCode": "0x60", "EventName": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD", - "PublicDescription": "This event counts the number of offcore outs= tanding Demand Data Read transactions in the super queue (SQ) every cycle. = A transaction is considered to be in the Offcore outstanding state between = L2 miss and transaction completion sent to requestor. See the corresponding= Umask under OFFCORE_REQUESTS.\nNote: A prefetch promoted to Demand is coun= ted from the promotion point.", + "PublicDescription": "This event counts the number of offcore outs= tanding Demand Data Read transactions in the super queue (SQ) every cycle. = A transaction is considered to be in the Offcore outstanding state between = L2 miss and transaction completion sent to requestor. See the corresponding= Umask under OFFCORE_REQUESTS. Note: A prefetch promoted to Demand is count= ed from the promotion point.", "SampleAfterValue": "2000003", "UMask": "0x1" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/frontend.json b/tool= s/perf/pmu-events/arch/x86/broadwellx/frontend.json index db3488abf9fc..018020a51436 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/frontend.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/frontend.json @@ -12,7 +12,7 @@ "Counter": "0,1,2,3", "EventCode": "0xAB", "EventName": "DSB2MITE_SWITCHES.PENALTY_CYCLES", - "PublicDescription": "This event counts Decode Stream Buffer (DSB)= -to-MITE switch true penalty cycles. These cycles do not include uops route= d through because of the switch itself, for example, when Instruction Decod= e Queue (IDQ) pre-allocation is unavailable, or Instruction Decode Queue (I= DQ) is full. SBD-to-MITE switch true penalty cycles happen after the merge = mux (MM) receives Decode Stream Buffer (DSB) Sync-indication until receivin= g the first MITE uop. \nMM is placed before Instruction Decode Queue (IDQ) = to merge uops being fed from the MITE and Decode Stream Buffer (DSB) paths.= Decode Stream Buffer (DSB) inserts the Sync-indication whenever a Decode S= tream Buffer (DSB)-to-MITE switch occurs.\nPenalty: A Decode Stream Buffer = (DSB) hit followed by a Decode Stream Buffer (DSB) miss can cost up to six = cycles in which no uops are delivered to the IDQ. Most often, such switches= from the Decode Stream Buffer (DSB) to the legacy pipeline cost 02 cycles.= ", + "PublicDescription": "This event counts Decode Stream Buffer (DSB)= -to-MITE switch true penalty cycles. These cycles do not include uops route= d through because of the switch itself, for example, when Instruction Decod= e Queue (IDQ) pre-allocation is unavailable, or Instruction Decode Queue (I= DQ) is full. SBD-to-MITE switch true penalty cycles happen after the merge = mux (MM) receives Decode Stream Buffer (DSB) Sync-indication until receivin= g the first MITE uop. MM is placed before Instruction Decode Queue (IDQ) t= o merge uops being fed from the MITE and Decode Stream Buffer (DSB) paths. = Decode Stream Buffer (DSB) inserts the Sync-indication whenever a Decode St= ream Buffer (DSB)-to-MITE switch occurs. Penalty: A Decode Stream Buffer (D= SB) hit followed by a Decode Stream Buffer (DSB) miss can cost up to six cy= cles in which no uops are delivered to the IDQ. Most often, such switches f= rom the Decode Stream Buffer (DSB) to the legacy pipeline cost 02 cycles.", "SampleAfterValue": "2000003", "UMask": "0x2" }, @@ -212,7 +212,7 @@ "Counter": "0,1,2,3", "EventCode": "0x9C", "EventName": "IDQ_UOPS_NOT_DELIVERED.CORE", - "PublicDescription": "This event counts the number of uops not del= ivered to Resource Allocation Table (RAT) per thread adding 4 x when Resou= rce Allocation Table (RAT) is not stalled and Instruction Decode Queue (IDQ= ) delivers x uops to Resource Allocation Table (RAT) (where x belongs to {0= ,1,2,3}). Counting does not cover cases when:\n a. IDQ-Resource Allocation = Table (RAT) pipe serves the other thread;\n b. Resource Allocation Table (R= AT) is stalled for the thread (including uop drops and clear BE conditions)= ; \n c. Instruction Decode Queue (IDQ) delivers four uops.", + "PublicDescription": "This event counts the number of uops not del= ivered to Resource Allocation Table (RAT) per thread adding 4 x when Resou= rce Allocation Table (RAT) is not stalled and Instruction Decode Queue (IDQ= ) delivers x uops to Resource Allocation Table (RAT) (where x belongs to {0= ,1,2,3}). Counting does not cover cases when: a. IDQ-Resource Allocation T= able (RAT) pipe serves the other thread; b. Resource Allocation Table (RAT= ) is stalled for the thread (including uop drops and clear BE conditions); = c. Instruction Decode Queue (IDQ) delivers four uops.", "SampleAfterValue": "2000003", "UMask": "0x1" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/memory.json b/tools/= perf/pmu-events/arch/x86/broadwellx/memory.json index 86246f632d79..093c8b564fc0 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/memory.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/memory.json @@ -68,7 +68,7 @@ "Counter": "0,1,2,3", "EventCode": "0xc8", "EventName": "HLE_RETIRED.START", - "PublicDescription": "Number of times we entered an HLE region\n d= oes not count nested transactions.", + "PublicDescription": "Number of times we entered an HLE region do= es not count nested transactions.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -77,7 +77,7 @@ "Counter": "0,1,2,3", "EventCode": "0xC3", "EventName": "MACHINE_CLEARS.MEMORY_ORDERING", - "PublicDescription": "This event counts the number of memory order= ing Machine Clears detected. Memory Ordering Machine Clears can result from= one of the following:\n1. memory disambiguation,\n2. external snoop, or\n3= . cross SMT-HW-thread snoop (stores) hitting load buffer.", + "PublicDescription": "This event counts the number of memory order= ing Machine Clears detected. Memory Ordering Machine Clears can result from= one of the following: 1. memory disambiguation, 2. external snoop, or 3. c= ross SMT-HW-thread snoop (stores) hitting load buffer.", "SampleAfterValue": "100003", "UMask": "0x2" }, @@ -470,7 +470,7 @@ "Counter": "0,1,2,3", "EventCode": "0xc9", "EventName": "RTM_RETIRED.START", - "PublicDescription": "Number of times we entered an RTM region\n d= oes not count nested transactions.", + "PublicDescription": "Number of times we entered an RTM region do= es not count nested transactions.", "SampleAfterValue": "2000003", "UMask": "0x1" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/metricgroups.json b/= tools/perf/pmu-events/arch/x86/broadwellx/metricgroups.json index 4193c90c3459..0863375bdead 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/metricgroups.json @@ -9,6 +9,7 @@ "BvCB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvFB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvIO": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvML": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvMP": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvMS": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", @@ -34,6 +35,7 @@ "InsType": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -51,6 +53,7 @@ "Pipeline": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "PortsUtil": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", "Power": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "Prefetches": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", "Ret": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", "Retire": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", "SMT": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -78,6 +81,7 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -103,6 +107,7 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/pipeline.json b/tool= s/perf/pmu-events/arch/x86/broadwellx/pipeline.json index c03f77539362..962cd07eb307 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/pipeline.json @@ -379,7 +379,7 @@ "BriefDescription": "Reference cycles when the core is not in halt= state.", "Counter": "Fixed counter 2", "EventName": "CPU_CLK_UNHALTED.REF_TSC", - "PublicDescription": "This event counts the number of reference cy= cles when the core is not in a halt state. The core enters the halt state w= hen it is running the HLT instruction or the MWAIT instruction. This event = is not affected by core frequency changes (for example, P states, TM2 trans= itions) but has the same incrementing frequency as the time stamp counter. = This event can approximate elapsed time while the core was not in a halt st= ate. This event has a constant ratio with the CPU_CLK_UNHALTED.REF_XCLK eve= nt. It is counted on a dedicated fixed counter, leaving the four (eight whe= n Hyperthreading is disabled) programmable counters available for other eve= nts. \nNote: On all current platforms this event stops counting during 'thr= ottling (TM)' states duty off periods the processor is 'halted'. This even= t is clocked by base clock (100 Mhz) on Sandy Bridge. The counter update is= done at a lower clock rate then the core clock the overflow status bit for= this counter may appear 'sticky'. After the counter has overflowed and so= ftware clears the overflow status bit and resets the counter to less than M= AX. The reset value to the counter is not clocked immediately so the overfl= ow status bit will flip 'high (1)' and generate another PMI (if enabled) af= ter which the reset value gets clocked into the counter. Therefore, softwar= e will get the interrupt, read the overflow status bit '1 for bit 34 while = the counter value is less than MAX. Software should ignore this case.", + "PublicDescription": "This event counts the number of reference cy= cles when the core is not in a halt state. The core enters the halt state w= hen it is running the HLT instruction or the MWAIT instruction. This event = is not affected by core frequency changes (for example, P states, TM2 trans= itions) but has the same incrementing frequency as the time stamp counter. = This event can approximate elapsed time while the core was not in a halt st= ate. This event has a constant ratio with the CPU_CLK_UNHALTED.REF_XCLK eve= nt. It is counted on a dedicated fixed counter, leaving the four (eight whe= n Hyperthreading is disabled) programmable counters available for other eve= nts. Note: On all current platforms this event stops counting during 'thro= ttling (TM)' states duty off periods the processor is 'halted'. This event= is clocked by base clock (100 Mhz) on Sandy Bridge. The counter update is = done at a lower clock rate then the core clock the overflow status bit for = this counter may appear 'sticky'. After the counter has overflowed and sof= tware clears the overflow status bit and resets the counter to less than MA= X. The reset value to the counter is not clocked immediately so the overflo= w status bit will flip 'high (1)' and generate another PMI (if enabled) aft= er which the reset value gets clocked into the counter. Therefore, software= will get the interrupt, read the overflow status bit '1 for bit 34 while t= he counter value is less than MAX. Software should ignore this case.", "SampleAfterValue": "2000003", "UMask": "0x3" }, @@ -579,7 +579,7 @@ "BriefDescription": "Instructions retired from execution.", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PublicDescription": "This event counts the number of instructions= retired from execution. For instructions that consist of multiple micro-op= s, this event counts the retirement of the last micro-op of the instruction= . Counting continues during hardware interrupts, traps, and inside interrup= t handlers. \nNotes: INST_RETIRED.ANY is counted by a designated fixed coun= ter, leaving the four (eight when Hyperthreading is disabled) programmable = counters available for other events. INST_RETIRED.ANY_P is counted by a pro= grammable counter and it is an architectural performance event. \nCounting:= Faulting executions of GETSEC/VM entry/VM Exit/MWait will not count as ret= ired instructions.", + "PublicDescription": "This event counts the number of instructions= retired from execution. For instructions that consist of multiple micro-op= s, this event counts the retirement of the last micro-op of the instruction= . Counting continues during hardware interrupts, traps, and inside interrup= t handlers. Notes: INST_RETIRED.ANY is counted by a designated fixed count= er, leaving the four (eight when Hyperthreading is disabled) programmable c= ounters available for other events. INST_RETIRED.ANY_P is counted by a prog= rammable counter and it is an architectural performance event. Counting: F= aulting executions of GETSEC/VM entry/VM Exit/MWait will not count as retir= ed instructions.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -654,7 +654,7 @@ "Counter": "0,1,2,3", "EventCode": "0x03", "EventName": "LD_BLOCKS.STORE_FORWARD", - "PublicDescription": "This event counts how many times the load op= eration got the true Block-on-Store blocking code preventing store forwardi= ng. This includes cases when:\n - preceding store conflicts with the load (= incomplete overlap);\n - store forwarding is impossible due to u-arch limit= ations;\n - preceding lock RMW operations are not forwarded;\n - store has = the no-forward bit set (uncacheable/page-split/masked stores);\n - all-bloc= king stores are used (mostly, fences and port I/O);\nand others.\nThe most = common case is a load blocked due to its address range overlapping with a p= receding smaller uncompleted store. Note: This event does not take into acc= ount cases of out-of-SW-control (for example, SbTailHit), unknown physical = STA, and cases of blocking loads on store due to being non-WB memory type o= r a lock. These cases are covered by other events.\nSee the table of not su= pported store forwards in the Optimization Guide.", + "PublicDescription": "This event counts how many times the load op= eration got the true Block-on-Store blocking code preventing store forwardi= ng. This includes cases when: - preceding store conflicts with the load (i= ncomplete overlap); - store forwarding is impossible due to u-arch limitat= ions; - preceding lock RMW operations are not forwarded; - store has the = no-forward bit set (uncacheable/page-split/masked stores); - all-blocking = stores are used (mostly, fences and port I/O); and others. The most common = case is a load blocked due to its address range overlapping with a precedin= g smaller uncompleted store. Note: This event does not take into account ca= ses of out-of-SW-control (for example, SbTailHit), unknown physical STA, an= d cases of blocking loads on store due to being non-WB memory type or a loc= k. These cases are covered by other events. See the table of not supported = store forwards in the Optimization Guide.", "SampleAfterValue": "100003", "UMask": "0x2" }, @@ -822,7 +822,7 @@ "Counter": "0,1,2,3", "EventCode": "0x5E", "EventName": "RS_EVENTS.EMPTY_CYCLES", - "PublicDescription": "This event counts cycles during which the re= servation station (RS) is empty for the thread.\nNote: In ST-mode, not acti= ve thread should drive 0. This is usually caused by severely costly branch = mispredictions, or allocator/FE issues.", + "PublicDescription": "This event counts cycles during which the re= servation station (RS) is empty for the thread. Note: In ST-mode, not activ= e thread should drive 0. This is usually caused by severely costly branch m= ispredictions, or allocator/FE issues.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -1177,7 +1177,7 @@ "Counter": "0,1,2,3", "EventCode": "0x0E", "EventName": "UOPS_ISSUED.FLAGS_MERGE", - "PublicDescription": "Number of flags-merge uops being allocated. = Such uops considered perf sensitive\n added by GSR u-arch.", + "PublicDescription": "Number of flags-merge uops being allocated. = Such uops considered perf sensitive added by GSR u-arch.", "SampleAfterValue": "2000003", "UMask": "0x10" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/uncore-cache.json b/= tools/perf/pmu-events/arch/x86/broadwellx/uncore-cache.json index b55b305aecaa..8f6955615bfd 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/uncore-cache.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/uncore-cache.json @@ -802,7 +802,7 @@ "EventCode": "0x12", "EventName": "UNC_C_RxR_EXT_STARVED.IPQ", "PerPkg": "1", - "PublicDescription": "Counts cycles in external starvation. This = occurs when one of the ingress queues is being starved by the other queues.= ; IPQ is externally startved and therefore we are blocking the IRQ.", + "PublicDescription": "Counts cycles in external starvation. This = occurs when one of the ingress queues is being starved by the other queues.= ; IPQ is externally starved and therefore we are blocking the IRQ.", "UMask": "0x2", "Unit": "CBOX" }, @@ -1859,7 +1859,7 @@ "EventCode": "0x14", "EventName": "UNC_H_BYPASS_IMC.NOT_TAKEN", "PerPkg": "1", - "PublicDescription": "Counts the number of times when the HA was a= ble to bypass was attempted. This is a latency optimization for situations= when there is light loadings on the memory subsystem. This can be filted = by when the bypass was taken and when it was not.; Filter for transactions = that could not take the bypass.", + "PublicDescription": "Counts the number of times when the HA was a= ble to bypass was attempted. This is a latency optimization for situations= when there is light loadings on the memory subsystem. This can be filtere= d by when the bypass was taken and when it was not.; Filter for transaction= s that could not take the bypass.", "UMask": "0x2", "Unit": "HA" }, @@ -1869,7 +1869,7 @@ "EventCode": "0x14", "EventName": "UNC_H_BYPASS_IMC.TAKEN", "PerPkg": "1", - "PublicDescription": "Counts the number of times when the HA was a= ble to bypass was attempted. This is a latency optimization for situations= when there is light loadings on the memory subsystem. This can be filted = by when the bypass was taken and when it was not.; Filter for transactions = that succeeded in taking the bypass.", + "PublicDescription": "Counts the number of times when the HA was a= ble to bypass was attempted. This is a latency optimization for situations= when there is light loadings on the memory subsystem. This can be filtere= d by when the bypass was taken and when it was not.; Filter for transaction= s that succeeded in taking the bypass.", "UMask": "0x1", "Unit": "HA" }, @@ -2884,7 +2884,7 @@ "EventCode": "0x15", "EventName": "UNC_H_RPQ_CYCLES_NO_REG_CREDITS.CHN0", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high banwidth workloads should be able to make use of all of the regular b= uffers, but it will be difficult (and uncommon) to make use of both the reg= ular and special buffers at the same time. One can filter based on the mem= ory controller channel. One or more channels can be tracked at a given tim= e.; Filter for memory controller channel 0 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high bandwidth workloads should be able to make use of all of the regular = buffers, but it will be difficult (and uncommon) to make use of both the re= gular and special buffers at the same time. One can filter based on the me= mory controller channel. One or more channels can be tracked at a given ti= me.; Filter for memory controller channel 0 only.", "UMask": "0x1", "Unit": "HA" }, @@ -2894,7 +2894,7 @@ "EventCode": "0x15", "EventName": "UNC_H_RPQ_CYCLES_NO_REG_CREDITS.CHN1", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high banwidth workloads should be able to make use of all of the regular b= uffers, but it will be difficult (and uncommon) to make use of both the reg= ular and special buffers at the same time. One can filter based on the mem= ory controller channel. One or more channels can be tracked at a given tim= e.; Filter for memory controller channel 1 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high bandwidth workloads should be able to make use of all of the regular = buffers, but it will be difficult (and uncommon) to make use of both the re= gular and special buffers at the same time. One can filter based on the me= mory controller channel. One or more channels can be tracked at a given ti= me.; Filter for memory controller channel 1 only.", "UMask": "0x2", "Unit": "HA" }, @@ -2904,7 +2904,7 @@ "EventCode": "0x15", "EventName": "UNC_H_RPQ_CYCLES_NO_REG_CREDITS.CHN2", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high banwidth workloads should be able to make use of all of the regular b= uffers, but it will be difficult (and uncommon) to make use of both the reg= ular and special buffers at the same time. One can filter based on the mem= ory controller channel. One or more channels can be tracked at a given tim= e.; Filter for memory controller channel 2 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high bandwidth workloads should be able to make use of all of the regular = buffers, but it will be difficult (and uncommon) to make use of both the re= gular and special buffers at the same time. One can filter based on the me= mory controller channel. One or more channels can be tracked at a given ti= me.; Filter for memory controller channel 2 only.", "UMask": "0x4", "Unit": "HA" }, @@ -2914,7 +2914,7 @@ "EventCode": "0x15", "EventName": "UNC_H_RPQ_CYCLES_NO_REG_CREDITS.CHN3", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high banwidth workloads should be able to make use of all of the regular b= uffers, but it will be difficult (and uncommon) to make use of both the reg= ular and special buffers at the same time. One can filter based on the mem= ory controller channel. One or more channels can be tracked at a given tim= e.; Filter for memory controller channel 3 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high bandwidth workloads should be able to make use of all of the regular = buffers, but it will be difficult (and uncommon) to make use of both the re= gular and special buffers at the same time. One can filter based on the me= mory controller channel. One or more channels can be tracked at a given ti= me.; Filter for memory controller channel 3 only.", "UMask": "0x8", "Unit": "HA" }, @@ -3448,7 +3448,7 @@ "EventCode": "0x3", "EventName": "UNC_H_TRACKER_CYCLES_NE.ALL", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occpancy. HA tra= ckers are allocated as soon as a request enters the HA if an HT (Home Track= er) entry is available and is released after the snoop response and data re= turn (or post in the case of a write) and the response is returned on the r= ing.; Requests coming from both local and remote sockets.", + "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occupancy. HA tr= ackers are allocated as soon as a request enters the HA if an HT (Home Trac= ker) entry is available and is released after the snoop response and data r= eturn (or post in the case of a write) and the response is returned on the = ring.; Requests coming from both local and remote sockets.", "UMask": "0x3", "Unit": "HA" }, @@ -3458,7 +3458,7 @@ "EventCode": "0x3", "EventName": "UNC_H_TRACKER_CYCLES_NE.LOCAL", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occpancy. HA tra= ckers are allocated as soon as a request enters the HA if an HT (Home Track= er) entry is available and is released after the snoop response and data re= turn (or post in the case of a write) and the response is returned on the r= ing.; This filter includes only requests coming from the local socket.", + "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occupancy. HA tr= ackers are allocated as soon as a request enters the HA if an HT (Home Trac= ker) entry is available and is released after the snoop response and data r= eturn (or post in the case of a write) and the response is returned on the = ring.; This filter includes only requests coming from the local socket.", "UMask": "0x1", "Unit": "HA" }, @@ -3468,7 +3468,7 @@ "EventCode": "0x3", "EventName": "UNC_H_TRACKER_CYCLES_NE.REMOTE", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occpancy. HA tra= ckers are allocated as soon as a request enters the HA if an HT (Home Track= er) entry is available and is released after the snoop response and data re= turn (or post in the case of a write) and the response is returned on the r= ing.; This filter includes only requests coming from remote sockets.", + "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occupancy. HA tr= ackers are allocated as soon as a request enters the HA if an HT (Home Trac= ker) entry is available and is released after the snoop response and data r= eturn (or post in the case of a write) and the response is returned on the = ring.; This filter includes only requests coming from remote sockets.", "UMask": "0x2", "Unit": "HA" }, @@ -3888,7 +3888,7 @@ "EventCode": "0x18", "EventName": "UNC_H_WPQ_CYCLES_NO_REG_CREDITS.CHN0", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high banwidth workloads should be able to make use of all of the regu= lar buffers, but it will be difficult (and uncommon) to make use of both th= e regular and special buffers at the same time. One can filter based on th= e memory controller channel. One or more channels can be tracked at a give= n time.; Filter for memory controller channel 0 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high bandwidth workloads should be able to make use of all of the reg= ular buffers, but it will be difficult (and uncommon) to make use of both t= he regular and special buffers at the same time. One can filter based on t= he memory controller channel. One or more channels can be tracked at a giv= en time.; Filter for memory controller channel 0 only.", "UMask": "0x1", "Unit": "HA" }, @@ -3898,7 +3898,7 @@ "EventCode": "0x18", "EventName": "UNC_H_WPQ_CYCLES_NO_REG_CREDITS.CHN1", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high banwidth workloads should be able to make use of all of the regu= lar buffers, but it will be difficult (and uncommon) to make use of both th= e regular and special buffers at the same time. One can filter based on th= e memory controller channel. One or more channels can be tracked at a give= n time.; Filter for memory controller channel 1 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high bandwidth workloads should be able to make use of all of the reg= ular buffers, but it will be difficult (and uncommon) to make use of both t= he regular and special buffers at the same time. One can filter based on t= he memory controller channel. One or more channels can be tracked at a giv= en time.; Filter for memory controller channel 1 only.", "UMask": "0x2", "Unit": "HA" }, @@ -3908,7 +3908,7 @@ "EventCode": "0x18", "EventName": "UNC_H_WPQ_CYCLES_NO_REG_CREDITS.CHN2", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high banwidth workloads should be able to make use of all of the regu= lar buffers, but it will be difficult (and uncommon) to make use of both th= e regular and special buffers at the same time. One can filter based on th= e memory controller channel. One or more channels can be tracked at a give= n time.; Filter for memory controller channel 2 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high bandwidth workloads should be able to make use of all of the reg= ular buffers, but it will be difficult (and uncommon) to make use of both t= he regular and special buffers at the same time. One can filter based on t= he memory controller channel. One or more channels can be tracked at a giv= en time.; Filter for memory controller channel 2 only.", "UMask": "0x4", "Unit": "HA" }, @@ -3918,7 +3918,7 @@ "EventCode": "0x18", "EventName": "UNC_H_WPQ_CYCLES_NO_REG_CREDITS.CHN3", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high banwidth workloads should be able to make use of all of the regu= lar buffers, but it will be difficult (and uncommon) to make use of both th= e regular and special buffers at the same time. One can filter based on th= e memory controller channel. One or more channels can be tracked at a give= n time.; Filter for memory controller channel 3 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high bandwidth workloads should be able to make use of all of the reg= ular buffers, but it will be difficult (and uncommon) to make use of both t= he regular and special buffers at the same time. One can filter based on t= he memory controller channel. One or more channels can be tracked at a giv= en time.; Filter for memory controller channel 3 only.", "UMask": "0x8", "Unit": "HA" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/uncore-interconnect.= json b/tools/perf/pmu-events/arch/x86/broadwellx/uncore-interconnect.json index 765d44012bba..56d729c8e946 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/uncore-interconnect.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/uncore-interconnect.json @@ -1,24 +1,4 @@ [ - { - "BriefDescription": "Number of non data (control) flits transmitte= d . Derived from unc_q_txl_flits_g0.non_data", - "Counter": "0,1,2,3", - "EventName": "QPI_CTL_BANDWIDTH_TX", - "PerPkg": "1", - "PublicDescription": "Counts the number of flits transmitted acros= s the QPI Link. It includes filters for Idle, protocol, and Data Flits. E= ach flit is made up of 80 bits of information (in addition to some ECC data= ). In full-width (L0) mode, flits are made up of four fits, each of which = contains 20 bits of data (along with some additional ECC data). In half-w= idth (L0p) mode, the fits are only 10 bits, and therefore it takes twice as= many fits to transmit a flit. When one talks about QPI speed (for example= , 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the syste= m will transfer 1 flit at the rate of 1/4th the QPI speed. One can calcula= te the bandwidth of the link by taking: flits*80b/time. Note that this is = not the same as data bandwidth. For example, when we are transferring a 64= B cacheline across QPI, we will break it into 9 flits -- 1 with header info= rmation and 8 with 64 bits of actual data and an additional 16 bits of othe= r information. To calculate data bandwidth, one should therefore do: data = flits * 8B / time (for L0) or 4B instead of 8B for L0p.; Number of non-NULL= non-data flits transmitted across QPI. This basically tracks the protocol= overhead on the QPI link. One can get a good picture of the QPI-link char= acteristics by evaluating the protocol flits, data flits, and idle/null fli= ts. This includes the header flits for data packets.", - "ScaleUnit": "8Bytes", - "UMask": "0x4", - "Unit": "QPI" - }, - { - "BriefDescription": "Number of data flits transmitted . Derived fr= om unc_q_txl_flits_g0.data", - "Counter": "0,1,2,3", - "EventName": "QPI_DATA_BANDWIDTH_TX", - "PerPkg": "1", - "PublicDescription": "Counts the number of flits transmitted acros= s the QPI Link. It includes filters for Idle, protocol, and Data Flits. E= ach flit is made up of 80 bits of information (in addition to some ECC data= ). In full-width (L0) mode, flits are made up of four fits, each of which = contains 20 bits of data (along with some additional ECC data). In half-w= idth (L0p) mode, the fits are only 10 bits, and therefore it takes twice as= many fits to transmit a flit. When one talks about QPI speed (for example= , 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the syste= m will transfer 1 flit at the rate of 1/4th the QPI speed. One can calcula= te the bandwidth of the link by taking: flits*80b/time. Note that this is = not the same as data bandwidth. For example, when we are transferring a 64= B cacheline across QPI, we will break it into 9 flits -- 1 with header info= rmation and 8 with 64 bits of actual data and an additional 16 bits of othe= r information. To calculate data bandwidth, one should therefore do: data = flits * 8B / time (for L0) or 4B instead of 8B for L0p.; Number of data fli= ts transmitted over QPI. Each flit contains 64b of data. This includes bo= th DRS and NCB data flits (coherent and non-coherent). This can be used to= calculate the data bandwidth of the QPI link. One can get a good picture = of the QPI-link characteristics by evaluating the protocol flits, data flit= s, and idle/null flits. This does not include the header flits that go in = data packets.", - "ScaleUnit": "8Bytes", - "UMask": "0x2", - "Unit": "QPI" - }, { "BriefDescription": "Total Write Cache Occupancy; Any Source", "Counter": "0,1", @@ -53,7 +33,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.CLFLUSH", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x80", "Unit": "IRP" }, @@ -63,7 +43,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.CRD", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x2", "Unit": "IRP" }, @@ -73,7 +53,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.DRD", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x4", "Unit": "IRP" }, @@ -83,7 +63,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.PCIDCAHINT", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x20", "Unit": "IRP" }, @@ -93,7 +73,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.PCIRDCUR", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x1", "Unit": "IRP" }, @@ -103,7 +83,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.PCITOM", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x10", "Unit": "IRP" }, @@ -113,7 +93,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.RFO", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x8", "Unit": "IRP" }, @@ -123,7 +103,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.WBMTOI", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x40", "Unit": "IRP" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/uncore-memory.json b= /tools/perf/pmu-events/arch/x86/broadwellx/uncore-memory.json index 45555316f8ea..ca09c1286485 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/uncore-memory.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/uncore-memory.json @@ -184,6 +184,7 @@ { "BriefDescription": "This event is deprecated. Refer to new event = UNC_M_CLOCKTICKS_P", "Counter": "0,1,2,3", + "Deprecated": "1", "EventName": "UNC_M_DCLOCKTICKS", "PerPkg": "1", "Unit": "iMC" diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index e4f4d727ca04..20a1b8ad38be 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -5,7 +5,7 @@ GenuineIntel-6-C[56],v1.06,arrowlake,core GenuineIntel-6-(1C|26|27|35|36),v5,bonnell,core GenuineIntel-6-(3D|47),v30,broadwell,core GenuineIntel-6-56,v12,broadwellde,core -GenuineIntel-6-4F,v22,broadwellx,core +GenuineIntel-6-4F,v23,broadwellx,core GenuineIntel-6-55-[56789ABCDEF],v1.22,cascadelakex,core GenuineIntel-6-9[6C],v1.05,elkhartlake,core GenuineIntel-6-CF,v1.09,emeraldrapids,core --=20 2.48.0.rc2.279.g1de40edade-goog From nobody Sun Dec 14 13:49:51 2025 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7ACAF18FDD5 for ; Thu, 16 Jan 2025 06:44:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009890; cv=none; b=VGJPTOx8ePKaWkzYQdzoNufOFHLhhgDFZITrA8p5BsOXaB/m86q4Smt1jPQkwHnLf02+H/QKJ+Tyqvo5FdeTz7d5N9weZacandDT1bWtSGWiYiXOi0bhai4q4TEA2MoU2cD2jE7epg2LxpX1WWGF6g1zeNe8rSd9zi50u/MG8zk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009890; c=relaxed/simple; bh=NVlTslSbkBWsLG1SsdAKbDBHj1NxrojOQqPj3chyccE=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=ipnYmw2nrpL4QcFQyrY4WcFzyMoQ2rMqbbADySpk/yi7DWWB0gIUes6gXqHch7ZFapZYbbUF5dAboleUUNHGAAbb8f7TQObi/I46HBXApOsPI7DyN/OJ1PVrzI/GxYwBCP0NlNzlM/PamiqC4967LdEH90F+XObxcNTlkcpUIic= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=OYy+aIDS; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="OYy+aIDS" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-e5789a8458eso1646646276.2 for ; Wed, 15 Jan 2025 22:44:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737009861; x=1737614661; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=OOaopAx0XQ3jKkh4DCkC+YlkzKgPQQq4fISWtXneLiM=; b=OYy+aIDS1C791c+JOVLJPR7vvqCZ+uq8WsCPWXj7Tj8ssSU0Z/J61MsBQ2nTdLyUsp 8GgfefJA/zTbNBvUN3mhlyn0xwmA6pibRnoKpyq2cneAm/+pxRTVPa+gGj6soNR1qrAB DG/L3/BOO32l65Ae6LKu6DwAoqFPsTIUUNCrAsi+urT9F/+70OFKGg4PyLf6iKUNepbg RCbAJh5pJSRQxWkNL6s5CKQ+4z+JZQxYkVWXkbe78NUIfP6ioWqFPtjkK1eF3lthKanu s7QldZjH2O0+gd/SavhcspiJJL3mmW9kZZKRjQ2KL3j1e5W6tPHhQiT9Jyl5CzFkNF/+ FEzg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737009861; x=1737614661; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=OOaopAx0XQ3jKkh4DCkC+YlkzKgPQQq4fISWtXneLiM=; b=eVHyPwiJxiMA0w5BvbE+eub89PxAaBWR62H9ya+X1r9wK9tK0PGRfA+Y6IaCvKZcrr y2+lXw51bMnHzp4MlwUi2d657YChbGHRpGjW15Kd2nVKyIK1cyGw4DVjuFjmwyMPcn/N o54j3EdAp3ooRCpD0BLNOYO7B2SdD3LSiJ+8oqUTW0y/GSgHMmEhpNHMc/TOopnK9bM/ a6UpOALUyamF1g0yGXDnHe8708pz7fD5+dIpEujzWCh+NPEYPW9isHo0AyK8yU48QWJT JJZHNrXeirh8xJ6nIJMcl6OkRVeXkHsj3LY1j7hq3dWgvrcTaO3aLDVLyTTa1LZBHNg7 OlGQ== X-Forwarded-Encrypted: i=1; AJvYcCXSBWksswW9rv6+5WPFDrbKrkKeHL/+rQmoTEXMb/54kiwSQjdVrNCp8kLlF2UJNafTZPL/AB9aTdfbkXE=@vger.kernel.org X-Gm-Message-State: AOJu0Yz4GskBsoL2HI0ErPQeQZfI38w1XvhgYztWyrxBUW8aRM3rMFST VJv2iiElmUAd094f/g+63ExPI27q5R0Jmo/nRxmMshfxiEOQcGZi+W2UhKv/b4Bs/FOjV7Lo+k4 YNLYQAw== X-Google-Smtp-Source: AGHT+IHSNuFh74jx8rMDDVyOZ4JGr8fqdQI8jv5Caeqef8m9y9Xk8IuHZgaIpI752AnXo2YLIIty/JiLzcS5 X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:7bd:ce19:c701:bfa3]) (user=irogers job=sendgmr) by 2002:a0d:d301:0:b0:6ee:b693:f752 with SMTP id 00721157ae682-6f532106188mr721167b3.0.1737009861034; Wed, 15 Jan 2025 22:44:21 -0800 (PST) Date: Wed, 15 Jan 2025 22:43:39 -0800 In-Reply-To: <20250116064355.344245-1-irogers@google.com> Message-Id: <20250116064355.344245-8-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250116064355.344245-1-irogers@google.com> X-Mailer: git-send-email 2.48.0.rc2.279.g1de40edade-goog Subject: [PATCH v2 07/23] perf vendor events: Update CascadelakeX events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.22 to v1.23. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v1.23: https://github.com/intel/perfmon/commit/8f3665f6be4688fd1dd1e713ba49ca16ec9= 3b856 The TMA 5.01 update is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-developed-by: Caleb Biggers Signed-off-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../arch/x86/cascadelakex/clx-metrics.json | 767 ++++++++++-------- .../arch/x86/cascadelakex/metricgroups.json | 9 +- .../arch/x86/cascadelakex/uncore-cache.json | 60 +- .../x86/cascadelakex/uncore-interconnect.json | 11 - tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 5 files changed, 464 insertions(+), 385 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json b= /tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json index b02a89e14c5d..5729b93a9c68 100644 --- a/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json +++ b/tools/perf/pmu-events/arch/x86/cascadelakex/clx-metrics.json @@ -55,7 +55,7 @@ "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles.", + "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles", "MetricExpr": "CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY", "MetricName": "cpi", "ScaleUnit": "1per_instr" @@ -76,31 +76,31 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte page sizes) caused by demand data loads to the total number of c= ompleted instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRE= D.ANY", "MetricName": "dtlb_2mb_large_page_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte page sizes) caused by demand data loads to the total number of = completed instructions. This implies it missed in the Data Translation Look= aside Buffer (DTLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte page sizes) caused by demand data loads to the total number of = completed instructions. This implies it missed in the Data Translation Look= aside Buffer (DTLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data loads to the total number of complete= d instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "dtlb_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data stores to the total number of complet= ed instructions", "MetricExpr": "DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", "MetricName": "dtlb_store_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB", "ScaleUnit": "1per_instr" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU.", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU", "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART0 + UNC_IIO_D= ATA_REQ_OF_CPU.MEM_READ.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART2 + UN= C_IIO_DATA_REQ_OF_CPU.MEM_READ.PART3) * 4 / 1e6 / duration_time", "MetricName": "io_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU", "MetricExpr": "(UNC_IIO_PAYLOAD_BYTES_IN.MEM_WRITE.PART0 + UNC_IIO= _PAYLOAD_BYTES_IN.MEM_WRITE.PART1 + UNC_IIO_PAYLOAD_BYTES_IN.MEM_WRITE.PART= 2 + UNC_IIO_PAYLOAD_BYTES_IN.MEM_WRITE.PART3) * 4 / 1e6 / duration_time", "MetricName": "io_bandwidth_write", "ScaleUnit": "1MB/s" @@ -109,14 +109,14 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total n= umber of completed instructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY= ", "MetricName": "itlb_large_page_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by a code fetch to the total number of completed ins= tructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "itlb_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB", "ScaleUnit": "1per_instr" }, { @@ -192,25 +192,25 @@ "ScaleUnit": "1per_instr" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to local memory.", + "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to local memory", "MetricExpr": "UNC_CHA_REQUESTS.READS_LOCAL * 64 / 1e6 / duration_= time", "MetricName": "llc_miss_local_memory_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to local memory.", + "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to local memory", "MetricExpr": "UNC_CHA_REQUESTS.WRITES_LOCAL * 64 / 1e6 / duration= _time", "MetricName": "llc_miss_local_memory_bandwidth_write", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to remote memory.", + "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to remote memory", "MetricExpr": "UNC_CHA_REQUESTS.READS_REMOTE * 64 / 1e6 / duration= _time", "MetricName": "llc_miss_remote_memory_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to remote memory.", + "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to remote memory", "MetricExpr": "UNC_CHA_REQUESTS.WRITES_REMOTE * 64 / 1e6 / duratio= n_time", "MetricName": "llc_miss_remote_memory_bandwidth_write", "ScaleUnit": "1MB/s" @@ -240,13 +240,13 @@ "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches.", + "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches", "MetricExpr": "cha@UNC_CHA_TOR_INSERTS.IA_MISS\\,config1\\=3D0x404= 32@ / (cha@UNC_CHA_TOR_INSERTS.IA_MISS\\,config1\\=3D0x40432@ + cha@UNC_CHA= _TOR_INSERTS.IA_MISS\\,config1\\=3D0x40431@)", "MetricName": "numa_reads_addressed_to_local_dram", "ScaleUnit": "100%" }, { - "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches.", + "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches", "MetricExpr": "cha@UNC_CHA_TOR_INSERTS.IA_MISS\\,config1\\=3D0x404= 31@ / (cha@UNC_CHA_TOR_INSERTS.IA_MISS\\,config1\\=3D0x40432@ + cha@UNC_CHA= _TOR_INSERTS.IA_MISS\\,config1\\=3D0x40431@)", "MetricName": "numa_reads_addressed_to_remote_dram", "ScaleUnit": "100%" @@ -313,12 +313,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", + "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT= .PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_= info_thread_slots", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -330,7 +330,7 @@ "MetricExpr": "34 * (FP_ASSIST.ANY + OTHER_ASSISTS.ANY) / tma_info= _thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y", "ScaleUnit": "100%" }, @@ -341,7 +341,7 @@ "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound.", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound", "ScaleUnit": "100%" }, { @@ -351,9 +351,102 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%" }, + { + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Ret", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4k_alias= ing + tma_fb_full)))", + "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_4k_aliasing + tma_fb_fu= ll)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tm= a_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_lock_latency / (tma_= dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma_lock_latenc= y + tma_split_loads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * = (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_boun= d + tma_store_bound)) * (tma_split_loads / (tma_dtlb_load + tma_store_fwd_b= lk + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4= k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_= bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * = (tma_split_stores / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_dtlb_store)) + tma_memory_bound * (tma_store_bound / (tma_l1_bound= + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_= store_latency / (tma_store_latency + tma_false_sharing + tma_split_stores += tma_dtlb_store)))", + "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * tma_fetch_latency * (tma_ms_swit= ches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteer= s * (10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts)) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_b= ranches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tm= a_ms_switches + tma_lcp + tma_dsb_switches)) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Fed;FetchBW;Frontend", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clea= rs_resteers + tma_mispredicts_resteers * (10 * tma_microcode_sequencer * tm= a_other_mispredicts / tma_branch_mispredicts)) / (tma_mispredicts_resteers = + tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_i= tlb_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_swit= ches) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_m= ispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes = / tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_= bound * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_= 0) / (tma_divider + tma_serializing_operation + tma_ports_utilization) + tm= a_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequence= r) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound= / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store= _bound)) * (tma_dtlb_store / (tma_store_latency + tma_false_sharing + tma_s= plit_stores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) * tma_remote_cach= e / (tma_local_mem + tma_remote_mem + tma_remote_cache) + tma_l3_bound / (t= ma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_boun= d) * (tma_contested_accesses + tma_data_sharing) / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / = (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bo= und) * tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_spl= it_stores + tma_dtlb_store - tma_store_latency)) + tma_machine_clears * (1 = - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears, tma_remote_cache" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Offcore", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Ret", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20" + }, { "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", "MetricConstraint": "NO_GROUP_EVENTS", @@ -362,7 +455,7 @@ "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredic= tions, tma_mispredicts_resteers", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost= , tma_mispredicts_resteers", "ScaleUnit": "100%" }, { @@ -370,8 +463,8 @@ "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -379,8 +472,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", "ScaleUnit": "100%" }, { @@ -388,18 +481,50 @@ "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRE= D.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / t= ma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_2M_= 4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_4K = / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(44 * tma_info_system_core_frequency * (MEM_LOAD_L3= _HIT_RETIRED.XSNP_HITM * (OCR.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE / (OCR.= DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE + OCR.DEMAND_DATA_RD.L3_HIT.HIT_OTHER= _CORE_FWD))) + 44 * tma_info_system_core_frequency * MEM_LOAD_L3_HIT_RETIRE= D.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)= / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricExpr": "((47.5 * tma_info_system_core_frequency - 3.5 * tma= _info_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (OCR.DE= MAND_DATA_RD.L3_HIT.HITM_OTHER_CORE / (OCR.DEMAND_DATA_RD.L3_HIT.HITM_OTHER= _CORE + OCR.DEMAND_DATA_RD.L3_HIT.HIT_OTHER_CORE_FWD))) + (47.5 * tma_info_= system_core_frequency - 3.5 * tma_info_system_core_frequency) * MEM_LOAD_L3= _HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L= 1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related= metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_fals= e_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { @@ -410,25 +535,25 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "44 * tma_info_system_core_frequency * (MEM_LOAD_L3_= HIT_RETIRED.XSNP_HIT + MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (1 - OCR.DEMAND_= DATA_RD.L3_HIT.HITM_OTHER_CORE / (OCR.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE= + OCR.DEMAND_DATA_RD.L3_HIT.HIT_OTHER_CORE_FWD))) * (1 + MEM_LOAD_RETIRED.= FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(47.5 * tma_info_system_core_frequency - 3.5 * tma_= info_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_= L3_HIT_RETIRED.XSNP_HITM * (1 - OCR.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE /= (OCR.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE + OCR.DEMAND_DATA_RD.L3_HIT.HIT= _OTHER_CORE_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MIS= S / 2) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT. Related metrics: tma_bottleneck_memory_synchron= ization, tma_contested_accesses, tma_false_sharing, tma_machine_clears, tma= _remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cpu@INS= T_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", + "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%" }, @@ -437,18 +562,18 @@ "MetricExpr": "ARITH.DIVIDER_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates how often the CPU was s= talled on accesses to external memory (DRAM) by loads", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_thread_cl= ks + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma= _info_thread_clks - tma_l2_bound - tma_pmm_bound if #has_pmem > 0 else CYCL= E_ACTIVITY.STALLS_L3_MISS / tma_info_thread_clks + (CYCLE_ACTIVITY.STALLS_L= 1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_info_thread_clks - tma_l2_bo= und)", + "MetricExpr": "CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_thread_clk= s + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks - tma_l2_bound", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS_PS", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%" }, { @@ -457,7 +582,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", "ScaleUnit": "100%" }, { @@ -465,47 +590,47 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "min(9 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D1= @ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE= _ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricExpr": "min(9 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D0= x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYC= LE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store, tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_me= mory_synchronization", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(9 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D1@ = + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", + "MetricExpr": "(9 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D0x1= @ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, = tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_memory_synchroniz= ation", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(110 * tma_info_system_core_frequency * (OCR.DEMAND= _RFO.L3_MISS.REMOTE_HITM + OCR.PF_L2_RFO.L3_MISS.REMOTE_HITM) + 47.5 * tma_= info_system_core_frequency * (OCR.DEMAND_RFO.L3_HIT.HITM_OTHER_CORE + OCR.P= F_L2_RFO.L3_HIT.HITM_OTHER_CORE)) / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related= metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM, OCR.DEMAND_RFO.L3_HIT.HITM_OTHER_CORE. Related metrics: = tma_bottleneck_memory_synchronization, tma_contested_accesses, tma_data_sha= ring, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D1@ / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D0x1@ / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_bottleneck_cache_memory_bandwi= dth, tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store= _latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy", "ScaleUnit": "100%" }, { @@ -515,7 +640,7 @@ "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LA= TENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_sw= itches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tm= a_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { @@ -525,17 +650,17 @@ "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or up to= ([SNB+] four; [ADL+] five) uops", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "tma_heavy_operations - tma_microcode_sequencer", "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", "MetricName": "tma_few_uops_instructions", "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or up t= o ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number o= f uops in such instructions. Related metrics: tma_decoder0_alone", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions.= Related metrics: tma_decoder0_alone", "ScaleUnit": "100%" }, { @@ -545,7 +670,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", "ScaleUnit": "100%" }, { @@ -554,7 +679,7 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", "ScaleUnit": "100%" }, { @@ -562,17 +687,17 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric approximates arithmetic floating-= point (FP) vector uops fraction the CPU has retired aggregated across all v= ector widths", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umas= k\\=3D0xfc@ / UOPS_RETIRED.RETIRE_SLOTS", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umas= k\\=3D0xFC@ / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -581,8 +706,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -590,8 +715,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -599,7 +724,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.512B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_512b", - "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -610,50 +735,50 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions -- where one uop can represent mu= ltiple contiguous instructions", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", "MetricExpr": "tma_light_operations * UOPS_RETIRED.MACRO_FUSED / U= OPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_fused_instructions", "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions -- where one uop can represent m= ultiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of = legacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Ot= her_Light_Ops in MTL!)}", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", "MetricExpr": "(UOPS_RETIRED.RETIRE_SLOTS + UOPS_RETIRED.MACRO_FUS= ED - INST_RETIRED.ANY) / tma_info_thread_slots", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", - "MetricExpr": "(ICACHE_16B.IFDATA_STALL + 2 * cpu@ICACHE_16B.IFDAT= A_STALL\\,cmask\\=3D1\\,edge@) / tma_info_thread_clks", + "MetricExpr": "(ICACHE_16B.IFDATA_STALL + 2 * cpu@ICACHE_16B.IFDAT= A_STALL\\,cmask\\=3D0x1\\,edge\\=3D0x1@) / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA sl= ots wasted per non-speculative branch misprediction (retired JEClear)", - "MetricExpr": "tma_info_bottleneck_mispredictions * tma_info_threa= d_slots / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 4 / BR_MISP_RETIRED.ALL_BRANCHES / 100", "MetricGroup": "Bad;BrMispredicts;tma_issueBM", "MetricName": "tma_info_bad_spec_branch_misprediction_cost", - "PublicDescription": "Branch Misprediction Cost: Fraction of TMA s= lots wasted per non-speculative branch misprediction (retired JEClear). Rel= ated metrics: tma_branch_mispredicts, tma_info_bottleneck_mispredictions, t= ma_mispredicts_resteers" + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", "MetricExpr": "tma_info_inst_mix_instructions / (UOPS_RETIRED.RETI= RE_SLOTS / UOPS_ISSUED.ANY * BR_MISP_EXEC.INDIRECT)", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" }, { "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", @@ -663,7 +788,7 @@ "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200" }, { - "BriefDescription": "Speculative to Retired ratio of all clears (c= overing mispredicts and nukes)", + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", "MetricGroup": "BrMispredicts", "MetricName": "tma_info_bad_spec_spec_clears_ratio" @@ -678,8 +803,8 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_mite= )))", - "MetricGroup": "DSB;FetchBW;tma_issueFB", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= )))", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp" @@ -687,7 +812,7 @@ { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_mite))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -695,108 +820,10 @@ }, { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", - "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", - "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "tma_info_bottleneck_big_code", - "MetricThreshold": "tma_info_bottleneck_big_code > 20" - }, - { - "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", - "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", - "MetricGroup": "BvBO;Ret", - "MetricName": "tma_info_bottleneck_branching_overhead", - "MetricThreshold": "tma_info_bottleneck_branching_overhead > 5", - "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma= _store_bound)) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)= ) + tma_memory_bound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma= _l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound)) * (tma_sq_full= / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq= _full)) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tma_l1_bound= + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound)) * (tma_f= b_full / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_l1_hit_latenc= y + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)))", - "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", - "MetricName": "tma_info_bottleneck_cache_memory_bandwidth", - "MetricThreshold": "tma_info_bottleneck_cache_memory_bandwidth > 2= 0", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma= _store_bound)) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) = + tma_memory_bound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l= 2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound)) * (tma_l3_hit_la= tency / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + t= ma_sq_full)) + tma_memory_bound * tma_l2_bound / (tma_dram_bound + tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) + tma= _memory_bound * (tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_= bound + tma_l3_bound + tma_pmm_bound + tma_store_bound)) * (tma_store_laten= cy / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_lat= ency)) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tma_l1_bound = + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound)) * (tma_l1= _hit_latency / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_l1_hit_= latency + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)))", - "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", - "MetricName": "tma_info_bottleneck_cache_memory_latency", - "MetricThreshold": "tma_info_bottleneck_cache_memory_latency > 20", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" - }, - { - "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_ports_utilization + tma_serializing_operation) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_ports_utilization + tma_serializin= g_operation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", - "MetricGroup": "BvCB;Cor;tma_issueComp", - "MetricName": "tma_info_bottleneck_compute_bound_est", - "MetricThreshold": "tma_info_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * tma_fetch_latency * (tma_ms_swit= ches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteer= s * (10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts)) / (tma_clears_resteers + tma_mispredicts_resteers + tma_unknown_b= ranches)) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + t= ma_itlb_misses + tma_lcp + tma_ms_switches)) - tma_info_bottleneck_big_code= ", - "MetricGroup": "BvFB;Fed;FetchBW;Frontend", - "MetricName": "tma_info_bottleneck_instruction_fetch_bw", - "MetricThreshold": "tma_info_bottleneck_instruction_fetch_bw > 20" - }, - { - "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clea= rs_resteers + tma_mispredicts_resteers * (10 * tma_microcode_sequencer * tm= a_other_mispredicts / tma_branch_mispredicts)) / (tma_clears_resteers + tma= _mispredicts_resteers + tma_unknown_branches)) / (tma_branch_resteers + tma= _dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_swit= ches) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_m= ispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes = / tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_= bound * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_= 0) / (tma_divider + tma_ports_utilization + tma_serializing_operation) + tm= a_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequence= r) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", - "MetricName": "tma_info_bottleneck_irregular_overhead", - "MetricThreshold": "tma_info_bottleneck_irregular_overhead > 10", - "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" - }, - { - "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_pmm_bound + tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_= 4k_aliasing + tma_dtlb_load + tma_fb_full + tma_l1_hit_latency + tma_lock_l= atency + tma_split_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_st= ore_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_pmm_bound + tma_store_bound)) * (tma_dtlb_store / (tma_dtlb_store + tma= _false_sharing + tma_split_stores + tma_store_latency)))", - "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_data_tlbs", - "MetricThreshold": "tma_info_bottleneck_memory_data_tlbs > 20", - "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store, tma_info_bottleneck_memory_synchronization" - }, - { - "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma= _store_bound) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) *= tma_remote_cache / (tma_local_mem + tma_remote_cache + tma_remote_mem) + t= ma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound = + tma_pmm_bound + tma_store_bound) * (tma_contested_accesses + tma_data_sha= ring) / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + t= ma_sq_full) + tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bou= nd + tma_l3_bound + tma_pmm_bound + tma_store_bound) * tma_false_sharing / = (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_latency = - tma_store_latency)) + tma_machine_clears * (1 - tma_other_nukes / tma_oth= er_nukes))", - "MetricGroup": "BvMS;Mem;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_synchronization", - "MetricThreshold": "tma_info_bottleneck_memory_synchronization > 1= 0", - "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_dtlb_load, tma_dtlb_store, tma_info_bottleneck_me= mory_data_tlbs" - }, - { - "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", - "MetricName": "tma_info_bottleneck_mispredictions", - "MetricThreshold": "tma_info_bottleneck_mispredictions > 20", - "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" - }, - { - "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", - "MetricExpr": "100 - (tma_info_bottleneck_big_code + tma_info_bott= leneck_instruction_fetch_bw + tma_info_bottleneck_mispredictions + tma_info= _bottleneck_cache_memory_bandwidth + tma_info_bottleneck_cache_memory_laten= cy + tma_info_bottleneck_memory_data_tlbs + tma_info_bottleneck_memory_sync= hronization + tma_info_bottleneck_compute_bound_est + tma_info_bottleneck_i= rregular_overhead + tma_info_bottleneck_branching_overhead + tma_info_bottl= eneck_useful_work)", - "MetricGroup": "BvOB;Cor;Offcore", - "MetricName": "tma_info_bottleneck_other_bottlenecks", - "MetricThreshold": "tma_info_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls." - }, - { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", - "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "BvUW;Ret", - "MetricName": "tma_info_bottleneck_useful_work", - "MetricThreshold": "tma_info_bottleneck_useful_work > 20" + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5" }, { "BriefDescription": "Fraction of branches that are CALL or RET", @@ -825,7 +852,7 @@ }, { "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTE= D.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CP= U_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_thread_clks))", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tm= a_info_thread_clks)", "MetricGroup": "SMT", "MetricName": "tma_info_core_core_clks" }, @@ -850,14 +877,14 @@ }, { "BriefDescription": "Actual per-core usage of the Floating Point n= on-X87 execution units (regardless of precision or vector-width)", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + cpu@FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE\\,umask\\=3D0xfc@) / (2 * tma_info_core_core_clks= )", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + cpu@FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE\\,umask\\=3D0xFC@) / (2 * tma_info_core_core_clks= )", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -870,20 +897,20 @@ "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / DSB2MITE_SWITCHE= S.COUNT", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost" }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D1\\,edge@ + 2", + "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D0x1\\,edge\\=3D0x1@ + 2", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency" }, @@ -913,7 +940,13 @@ "MetricName": "tma_info_frontend_l2mpki_code_all" }, { - "BriefDescription": "Branch instructions per taken branch.", + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, + { + "BriefDescription": "Branch instructions per taken branch", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -928,11 +961,11 @@ { "BriefDescription": "Instructions per FP Arithmetic instruction (l= ower number means higher occurrence rate)", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=3D0xfc@)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=3D0xFC@)", "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -940,7 +973,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -948,7 +981,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit in= struction (lower number means higher occurrence rate)", @@ -956,7 +989,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx512", "MetricThreshold": "tma_info_inst_mix_iparith_avx512 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -964,7 +997,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -972,7 +1005,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -1018,7 +1051,7 @@ }, { "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", - "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umas= k\\=3D0xF@", + "MetricExpr": "INST_RETIRED.ANY / SW_PREFETCH_ACCESS.ANY", "MetricGroup": "Prefetches", "MetricName": "tma_info_inst_mix_ipswpf", "MetricThreshold": "tma_info_inst_mix_ipswpf < 100" @@ -1028,7 +1061,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 9", + "MetricThreshold": "tma_info_inst_mix_iptb < 4 * 2 + 1", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp" }, { @@ -1075,7 +1108,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw" }, @@ -1093,7 +1126,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw" }, @@ -1135,13 +1168,13 @@ }, { "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", - "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration= _time", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", "MetricGroup": "Mem;MemoryBW;Offcore", "MetricName": "tma_info_memory_l3_cache_access_bw" }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw" }, @@ -1160,7 +1193,7 @@ { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency" }, { @@ -1216,7 +1249,7 @@ }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D1@)", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, @@ -1237,18 +1270,18 @@ "MetricExpr": "INST_RETIRED.ANY / (FP_ASSIST.ANY + OTHER_ASSISTS.A= NY)", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -1266,29 +1299,29 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / duration_time", + "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / tma_info_system_time", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", - "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_bottlenec= k_cache_memory_bandwidth, tma_mem_bandwidth, tma_sq_full" + "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_bottleneck_cache_memory_ban= dwidth, tma_fb_full, tma_mem_bandwidth, tma_sq_full" }, { "BriefDescription": "Giga Floating Point Operations Per Second", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.8_FLOPS + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1= e9 / duration_time", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.8_FLOPS + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1= e9 / tma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" }, { "BriefDescription": "Average IO (network or disk) Bandwidth Use fo= r Reads [GB / sec]", - "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART0 + UNC_IIO_= DATA_REQ_OF_CPU.MEM_WRITE.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART2 += UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART3) * 4 / 1e9 / duration_time", + "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART0 + UNC_IIO_= DATA_REQ_OF_CPU.MEM_WRITE.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART2 += UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART3) * 4 / 1e9 / tma_info_system_time", "MetricGroup": "IoBW;MemOffcore;Server;SoC", "MetricName": "tma_info_system_io_read_bw", "PublicDescription": "Average IO (network or disk) Bandwidth Use f= or Reads [GB / sec]. Bandwidth of IO reads that are initiated by end device= controllers that are requesting memory from the CPU" }, { "BriefDescription": "Average IO (network or disk) Bandwidth Use fo= r Writes [GB / sec]", - "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART0 + UNC_IIO_D= ATA_REQ_OF_CPU.MEM_READ.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART2 + UN= C_IIO_DATA_REQ_OF_CPU.MEM_READ.PART3) * 4 / 1e9 / duration_time", + "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART0 + UNC_IIO_D= ATA_REQ_OF_CPU.MEM_READ.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART2 + UN= C_IIO_DATA_REQ_OF_CPU.MEM_READ.PART3) * 4 / 1e9 / tma_info_system_time", "MetricGroup": "IoBW;MemOffcore;Server;SoC", "MetricName": "tma_info_system_io_write_bw", "PublicDescription": "Average IO (network or disk) Bandwidth Use f= or Writes [GB / sec]. Bandwidth of IO writes that are initiated by end devi= ce controllers that are writing memory to the CPU" @@ -1298,13 +1331,14 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi" + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -1322,43 +1356,37 @@ }, { "BriefDescription": "Average number of parallel data read requests= to external memory", - "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_OCC= UPANCY.IA_MISS_DRD@thresh\\=3D1@", + "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / cha@UNC_CHA_TOR= _OCCUPANCY.IA_MISS_DRD\\,thresh\\=3D0x1@", "MetricGroup": "Mem;MemoryBW;SoC", "MetricName": "tma_info_system_mem_parallel_reads", "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches" }, - { - "BriefDescription": "Average latency of data read request to exter= nal 3D X-Point memory [in nanoseconds]", - "MetricExpr": "(1e9 * (UNC_M_PMM_RPQ_OCCUPANCY.ALL / UNC_M_PMM_RPQ= _INSERTS) / imc_0@event\\=3D0x0@ if #has_pmem > 0 else 0)", - "MetricGroup": "MemOffcore;MemoryLat;Server;SoC", - "MetricName": "tma_info_system_mem_pmm_read_latency", - "PublicDescription": "Average latency of data read request to exte= rnal 3D X-Point memory [in nanoseconds]. Accounts for demand loads and L1/L= 2 data-read prefetches" - }, { "BriefDescription": "Average latency of data read request to exter= nal memory (in nanoseconds)", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_= TOR_INSERTS.IA_MISS_DRD) / (tma_info_system_socket_clks / duration_time)", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_= TOR_INSERTS.IA_MISS_DRD) / (tma_info_system_socket_clks / tma_info_system_t= ime)", "MetricGroup": "Mem;MemoryLat;SoC", "MetricName": "tma_info_system_mem_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetche= s. ([RKL+]memory-controller only)" }, { - "BriefDescription": "Average 3DXP Memory Bandwidth Use for reads [= GB / sec]", - "MetricExpr": "(64 * UNC_M_PMM_RPQ_INSERTS / 1e9 / duration_time i= f #has_pmem > 0 else 0)", - "MetricGroup": "MemOffcore;MemoryBW;Server;SoC", - "MetricName": "tma_info_system_pmm_read_bw" + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" }, { - "BriefDescription": "Average 3DXP Memory Bandwidth Use for Writes = [GB / sec]", - "MetricExpr": "(64 * UNC_M_PMM_WPQ_INSERTS / 1e9 / duration_time i= f #has_pmem > 0 else 0)", - "MetricGroup": "MemOffcore;MemoryBW;Server;SoC", - "MetricName": "tma_info_system_pmm_write_bw" + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "(power@energy\\-pkg@ * 61 + 15.6 * power@energy\\-r= am@) / (duration_time * 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for baseline license level 0", "MetricExpr": "(CORE_POWER.LVL0_TURBO_LICENSE / 2 / tma_info_core_= core_clks if #SMT_on else CORE_POWER.LVL0_TURBO_LICENSE / tma_info_core_cor= e_clks)", "MetricGroup": "Power", "MetricName": "tma_info_system_power_license0_utilization", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes" }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 1", @@ -1366,7 +1394,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license1_utilization", "MetricThreshold": "tma_info_system_power_license1_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions" }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 2 (introduced in SKX)", @@ -1374,7 +1402,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license2_utilization", "MetricThreshold": "tma_info_system_power_license2_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions" }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", @@ -1388,6 +1416,13 @@ "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -1396,12 +1431,12 @@ }, { "BriefDescription": "Measured Average Uncore Frequency for the SoC= [GHz]", - "MetricExpr": "tma_info_system_socket_clks / 1e9 / duration_time", + "MetricExpr": "tma_info_system_socket_clks / 1e9 / tma_info_system= _time", "MetricGroup": "SoC", "MetricName": "tma_info_system_uncore_frequency" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -1410,14 +1445,15 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi" + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -1443,43 +1479,52 @@ "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TA= KEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 6" + "MetricThreshold": "tma_info_thread_uptb < 4 * 1.5" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB= _HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_micr= ocode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates fraction of cyc= les with demand load accesses that hit the L1 cache", + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", - "MetricName": "tma_l1_hit_latency", - "MetricThreshold": "tma_l1_hit_latency > 0.1 & (tma_l1_bound > 0.1= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles with demand load accesses that hit the L1 cache. The short latency of = the L1 data cache may be exposed in pointer-chasing memory access patterns = as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + cpu@L1D_PEND_MISS.FB_FULL\\,cm= ask\\=3D1@) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_M= ISS) / tma_info_thread_clks)", + "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + cpu@L1D_PEND_MISS.FB_FULL\\,cm= ask\\=3D0x1@) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2= _MISS) / tma_info_thread_clks)", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT_PS", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "3.5 * tma_info_system_core_frequency * MEM_LOAD_RET= IRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) = / tma_info_thread_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, { @@ -1487,17 +1532,17 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STA= LLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "17 * tma_info_system_core_frequency * (MEM_LOAD_RET= IRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2))= / tma_info_thread_clks", + "MetricExpr": "(20.5 * tma_info_system_core_frequency - 3.5 * tma_= info_system_core_frequency) * (MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETI= RED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_info_bottleneck_cache_memory_latency, tma_mem_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", "ScaleUnit": "100%" }, { @@ -1505,18 +1550,18 @@ "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", "MetricExpr": "tma_retiring - tma_heavy_operations", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", "ScaleUnit": "100%" }, { @@ -1534,7 +1579,7 @@ "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1542,24 +1587,48 @@ "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from local memory", - "MetricExpr": "59.5 * tma_info_system_core_frequency * MEM_LOAD_L3= _MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.= L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(80 * tma_info_system_core_frequency - 20.5 * tma_i= nfo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM= _LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks= ", "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_grou= p", "MetricName": "tma_local_mem", - "MetricThreshold": "tma_local_mem > 0.1 & (tma_mem_latency > 0.1 &= (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)= ))", + "MetricThreshold": "tma_local_mem > 0.1 & tma_mem_latency > 0.1 & = tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_L3_MISS= _RETIRED.LOCAL_DRAM", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", "MetricExpr": "(12 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (11= * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAN= DING.CYCLES_WITH_DEMAND_RFO))) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, @@ -1571,16 +1640,16 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches, tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_bottleneck_cache= _memory_bandwidth, tma_info_system_dram_bw_use, tma_sq_full", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%" }, { @@ -1588,8 +1657,8 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_info_bottleneck_cache_memory_latency, tma_l3_hit_la= tency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%" }, { @@ -1600,11 +1669,11 @@ "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_= RETIRED.ANY", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_memory_operations", @@ -1618,7 +1687,7 @@ "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", "MetricName": "tma_microcode_sequencer", "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", - "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers,= tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_clears, = tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_bottleneck_irreg= ular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_m= s_switches", "ScaleUnit": "100%" }, { @@ -1626,8 +1695,8 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL= _BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_inf= o_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_branch_mispredicts, tma_info_bad_spec_branch_misprediction_cost= , tma_info_bottleneck_mispredictions", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%" }, { @@ -1640,12 +1709,12 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", "MetricExpr": "UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY= ", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", "ScaleUnit": "100%" }, { @@ -1653,8 +1722,8 @@ "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_= clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operat= ion", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, { @@ -1663,7 +1732,7 @@ "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_non_fused_branches", "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused.", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", "ScaleUnit": "100%" }, { @@ -1671,7 +1740,7 @@ "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / UOPS_RETI= RED.RETIRE_SLOTS", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, @@ -1685,29 +1754,19 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric roughly estimates (based on idle = latencies) how often the CPU was stalled on accesses to external 3D-Xpoint = (Crystal Ridge, a.k.a", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(((1 - (19 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM = * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 10 * (MEM_LOA= D_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETI= RED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_RETIRED.= FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM *= (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))) / (19 * (MEM_LO= AD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RE= TIRED.L1_MISS)) + 10 * (MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD= _RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMO= TE_FWD * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOA= D_L3_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RET= IRED.L1_MISS)) + (25 * (MEM_LOAD_RETIRED.LOCAL_PMM * (1 + MEM_LOAD_RETIRED.= FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 33 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE= _PMM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))))) * (CYCL= E_ACTIVITY.STALLS_L3_MISS / tma_info_thread_clks + (CYCLE_ACTIVITY.STALLS_L= 1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_info_thread_clks - tma_l2_bo= und) if 1e6 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_PMM + MEM_LOAD_RETIRED.LOCAL= _PMM) > MEM_LOAD_RETIRED.L1_MISS else 0) if #has_pmem > 0 else 0)", - "MetricGroup": "MemoryBound;Server;TmaL3mem;TopdownL3;tma_L3_group= ;tma_memory_bound_group", - "MetricName": "tma_pmm_bound", - "MetricThreshold": "tma_pmm_bound > 0.1 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric roughly estimates (based on idle= latencies) how often the CPU was stalled on accesses to external 3D-Xpoint= (Crystal Ridge, a.k.a. IXP) memory by loads, PMM stands for Persistent Mem= ory Module.", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { @@ -1761,7 +1820,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_5", "MetricThreshold": "tma_port_5 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar,= tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_5. Related metrics: tma_fp_sc= alar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_= 512b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1770,7 +1829,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_6. Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector= _512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_1. Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector= _512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1787,8 +1846,8 @@ "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL)) / tma_= info_thread_clks if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CY= CLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring = * EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", "ScaleUnit": "100%" }, { @@ -1796,8 +1855,8 @@ "MetricExpr": "EXE_ACTIVITY.EXE_BOUND_0_PORTS / tma_info_thread_cl= ks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", "ScaleUnit": "100%" }, { @@ -1805,7 +1864,7 @@ "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_1 - UOPS_EXECUTED.CO= RE_CYCLES_GE_2) / 2 if #SMT_on else EXE_ACTIVITY.1_PORTS_UTIL) / tma_info_c= ore_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, @@ -1814,35 +1873,35 @@ "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_2 - UOPS_EXECUTED.CO= RE_CYCLES_GE_3) / 2 if #SMT_on else EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_c= ore_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port= _6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise).", + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", "MetricExpr": "(UOPS_EXECUTED.CORE_CYCLES_GE_3 / 2 if #SMT_on else= UOPS_EXECUTED.CORE_CYCLES_GE_3) / tma_info_core_core_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote cache in other socket= s including synchronizations issues", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "(89.5 * tma_info_system_core_frequency * MEM_LOAD_L= 3_MISS_RETIRED.REMOTE_HITM + 89.5 * tma_info_system_core_frequency * MEM_LO= AD_L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RE= TIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "((110 * tma_info_system_core_frequency - 20.5 * tma= _info_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM + (110 = * tma_info_system_core_frequency - 20.5 * tma_info_system_core_frequency) *= MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_= LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_is= sueSyncxn;tma_mem_latency_group", "MetricName": "tma_remote_cache", - "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0= .1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > = 0.2)))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_L3_MISS_R= ETIRED.REMOTE_HITM_PS;MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD_PS. Related metri= cs: tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_machin= e_clears", + "MetricThreshold": "tma_remote_cache > 0.05 & tma_mem_latency > 0.= 1 & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2= ", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. Sample with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM, MEM= _LOAD_L3_MISS_RETIRED.REMOTE_FWD. Related metrics: tma_bottleneck_memory_sy= nchronization, tma_contested_accesses, tma_data_sharing, tma_false_sharing,= tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote memory", - "MetricExpr": "127 * tma_info_system_core_frequency * MEM_LOAD_L3_= MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.= L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(147.5 * tma_info_system_core_frequency - 20.5 * tm= a_info_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 += MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_= clks", "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latenc= y_group", "MetricName": "tma_remote_mem", - "MetricThreshold": "tma_remote_mem > 0.1 & (tma_mem_latency > 0.1 = & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. #link to NUMA article. Sample= with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM_PS", + "MetricThreshold": "tma_remote_mem > 0.1 & tma_mem_latency > 0.1 &= tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. Sample with: MEM_LOAD_L3_MISS= _RETIRED.REMOTE_DRAM", "ScaleUnit": "100%" }, { @@ -1860,7 +1919,7 @@ "MetricExpr": "PARTIAL_RAT_STALLS.SCOREBOARD / tma_info_thread_clk= s", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: PARTIAL_RAT_STALLS.SCOREBOARD. Related me= trics: tma_ms_switches", "ScaleUnit": "100%" }, @@ -1869,8 +1928,8 @@ "MetricExpr": "40 * ROB_MISC_EVENTS.PAUSE_INST / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_slow_pause", - "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: MISC_RETIRED.PAUS= E_INST", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: ROB_MISC_EVENTS.P= AUSE_INST", "ScaleUnit": "100%" }, { @@ -1879,8 +1938,8 @@ "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", "ScaleUnit": "100%" }, { @@ -1888,17 +1947,17 @@ "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES. Related metrics: tma_port_4", "ScaleUnit": "100%" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on els= e OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_core_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_bottleneck_cache_memory_bandwidth, tma_info_system_dram_bw_use, = tma_mem_bandwidth", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%" }, { @@ -1906,8 +1965,8 @@ "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", "ScaleUnit": "100%" }, { @@ -1915,18 +1974,18 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "(L2_RQSTS.RFO_HIT * 11 * (1 - MEM_INST_RETIRED.LOCK= _LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / = MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUEST= S_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -1942,7 +2001,7 @@ "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", "ScaleUnit": "100%" }, { @@ -1950,7 +2009,31 @@ "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1958,7 +2041,7 @@ "MetricExpr": "9 * BACLEARS.ANY / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, @@ -1967,8 +2050,8 @@ "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", "ScaleUnit": "100%" }, { diff --git a/tools/perf/pmu-events/arch/x86/cascadelakex/metricgroups.json = b/tools/perf/pmu-events/arch/x86/cascadelakex/metricgroups.json index cccfcab3425e..a579603f720b 100644 --- a/tools/perf/pmu-events/arch/x86/cascadelakex/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/cascadelakex/metricgroups.json @@ -38,6 +38,7 @@ "IoBW": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -84,7 +85,9 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_mispredicts_group": "Metrics contributing to tma_branch_mi= spredicts category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", + "tma_code_stlb_miss_group": "Metrics contributing to tma_code_stlb_mis= s category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -113,10 +116,13 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", + "tma_l2_bound_group": "Metrics contributing to tma_l2_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", "tma_load_op_utilization_group": "Metrics contributing to tma_load_op_= utilization category", + "tma_load_stlb_miss_group": "Metrics contributing to tma_load_stlb_mis= s category", "tma_machine_clears_group": "Metrics contributing to tma_machine_clear= s category", "tma_mem_latency_group": "Metrics contributing to tma_mem_latency cate= gory", "tma_memory_bound_group": "Metrics contributing to tma_memory_bound ca= tegory", @@ -129,5 +135,6 @@ "tma_retiring_group": "Metrics contributing to tma_retiring category", "tma_serializing_operation_group": "Metrics contributing to tma_serial= izing_operation category", "tma_store_bound_group": "Metrics contributing to tma_store_bound cate= gory", - "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category" + "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category", + "tma_store_stlb_miss_group": "Metrics contributing to tma_store_stlb_m= iss category" } diff --git a/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-cache.json = b/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-cache.json index 6347eba48810..c9596e18ec09 100644 --- a/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-cache.json +++ b/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-cache.json @@ -4577,7 +4577,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_HIT_CRD", - "Filter": "config1=3D0x4023300000000", + "Filter": "config1=3D0x40233", "PerPkg": "1", "PublicDescription": "TOR Inserts : CRds issued by iA Cores that H= it the LLC : Counts the number of entries successfully inserted into the TO= R that match qualifications specified by the subevent. Does not include a= ddressless requests such as locks and interrupts.", "UMask": "0x11", @@ -4588,7 +4588,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_HIT_DRD", - "Filter": "config1=3D0x4043300000000", + "Filter": "config1=3D0x40433", "PerPkg": "1", "PublicDescription": "TOR Inserts : DRds issued by iA Cores that H= it the LLC : Counts the number of entries successfully inserted into the TO= R that match qualifications specified by the subevent. Does not include a= ddressless requests such as locks and interrupts.", "UMask": "0x11", @@ -4599,7 +4599,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_HIT_LlcPrefCRD", - "Filter": "config1=3D0x4b23300000000", + "Filter": "config1=3D0x4b233", "PerPkg": "1", "UMask": "0x11", "Unit": "CHA" @@ -4609,7 +4609,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_HIT_LlcPrefDRD", - "Filter": "config1=3D0x4b43300000000", + "Filter": "config1=3D0x4b433", "PerPkg": "1", "UMask": "0x11", "Unit": "CHA" @@ -4619,7 +4619,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_HIT_LlcPrefRFO", - "Filter": "config1=3D0x4b03300000000", + "Filter": "config1=3D0x4b033", "PerPkg": "1", "PublicDescription": "TOR Inserts : LLCPrefRFO issued by iA Cores = that hit the LLC : Counts the number of entries successfully inserted into = the TOR that match qualifications specified by the subevent. Does not inc= lude addressless requests such as locks and interrupts.", "UMask": "0x11", @@ -4630,7 +4630,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_HIT_RFO", - "Filter": "config1=3D0x4003300000000", + "Filter": "config1=3D0x40033", "PerPkg": "1", "PublicDescription": "TOR Inserts : RFOs issued by iA Cores that H= it the LLC : Counts the number of entries successfully inserted into the TO= R that match qualifications specified by the subevent. Does not include a= ddressless requests such as locks and interrupts.", "UMask": "0x11", @@ -4651,7 +4651,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_CRD", - "Filter": "config1=3D0x4023300000000", + "Filter": "config1=3D0x40233", "PerPkg": "1", "PublicDescription": "TOR Inserts : CRds issued by iA Cores that M= issed the LLC : Counts the number of entries successfully inserted into the= TOR that match qualifications specified by the subevent. Does not includ= e addressless requests such as locks and interrupts.", "UMask": "0x21", @@ -4662,7 +4662,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_DRD", - "Filter": "config1=3D0x4043300000000", + "Filter": "config1=3D0x40433", "PerPkg": "1", "PublicDescription": "TOR Inserts : DRds issued by iA Cores that M= issed the LLC : Counts the number of entries successfully inserted into the= TOR that match qualifications specified by the subevent. Does not includ= e addressless requests such as locks and interrupts.", "UMask": "0x21", @@ -4673,7 +4673,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_LlcPrefCRD", - "Filter": "config1=3D0x4b23300000000", + "Filter": "config1=3D0x4b233", "PerPkg": "1", "UMask": "0x21", "Unit": "CHA" @@ -4683,7 +4683,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_LlcPrefDRD", - "Filter": "config1=3D0x4b43300000000", + "Filter": "config1=3D0x4b433", "PerPkg": "1", "UMask": "0x21", "Unit": "CHA" @@ -4693,7 +4693,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_LlcPrefRFO", - "Filter": "config1=3D0x4b03300000000", + "Filter": "config1=3D0x4b033", "PerPkg": "1", "PublicDescription": "TOR Inserts : LLCPrefRFO issued by iA Cores = that missed the LLC : Counts the number of entries successfully inserted in= to the TOR that match qualifications specified by the subevent. Does not = include addressless requests such as locks and interrupts.", "UMask": "0x21", @@ -4704,7 +4704,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_RFO", - "Filter": "config1=3D0x4003300000000", + "Filter": "config1=3D0x40033", "PerPkg": "1", "PublicDescription": "TOR Inserts : RFOs issued by iA Cores that M= issed the LLC : Counts the number of entries successfully inserted into the= TOR that match qualifications specified by the subevent. Does not includ= e addressless requests such as locks and interrupts.", "UMask": "0x21", @@ -4747,7 +4747,7 @@ "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IO_MISS_ITOM", "Experimental": "1", - "Filter": "config1=3D0x4903300000000", + "Filter": "config1=3D0x49033", "PerPkg": "1", "PublicDescription": "Counts the number of entries successfully in= serted into the TOR that are generated from local IO ItoM requests that mis= s the LLC. An ItoM request is used by IIO to request a data write without f= irst reading the data for ownership.", "UMask": "0x24", @@ -4759,7 +4759,7 @@ "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IO_MISS_RDCUR", "Experimental": "1", - "Filter": "config1=3D0x43c3300000000", + "Filter": "config1=3D0x43C33", "PerPkg": "1", "PublicDescription": "Counts the number of entries successfully in= serted into the TOR that are generated from local IO RdCur requests and mis= s the LLC. A RdCur request is used by IIO to read data without changing sta= te.", "UMask": "0x24", @@ -4771,7 +4771,7 @@ "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IO_MISS_RFO", "Experimental": "1", - "Filter": "config1=3D0x4003300000000", + "Filter": "config1=3D0x40033", "PerPkg": "1", "PublicDescription": "Counts the number of entries successfully in= serted into the TOR that are generated from local IO RFO requests that miss= the LLC. A read for ownership (RFO) requests a cache line to be cached in = E state with the intent to modify.", "UMask": "0x24", @@ -4999,7 +4999,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_CRD", - "Filter": "config1=3D0x4023300000000", + "Filter": "config1=3D0x40233", "PerPkg": "1", "PublicDescription": "TOR Occupancy : CRds issued by iA Cores that= Hit the LLC : For each cycle, this event accumulates the number of valid e= ntries in the TOR that match qualifications specified by the subevent. = Does not include addressless requests such as locks and interrupts.", "UMask": "0x11", @@ -5010,7 +5010,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_DRD", - "Filter": "config1=3D0x4043300000000", + "Filter": "config1=3D0x40433", "PerPkg": "1", "PublicDescription": "TOR Occupancy : DRds issued by iA Cores that= Hit the LLC : For each cycle, this event accumulates the number of valid e= ntries in the TOR that match qualifications specified by the subevent. = Does not include addressless requests such as locks and interrupts.", "UMask": "0x11", @@ -5021,7 +5021,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_LlcPrefCRD", - "Filter": "config1=3D0x4b23300000000", + "Filter": "config1=3D0x4b233", "PerPkg": "1", "UMask": "0x11", "Unit": "CHA" @@ -5031,7 +5031,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_LlcPrefDRD", - "Filter": "config1=3D0x4b43300000000", + "Filter": "config1=3D0x4b433", "PerPkg": "1", "UMask": "0x11", "Unit": "CHA" @@ -5041,7 +5041,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_LlcPrefRFO", - "Filter": "config1=3D0x4b03300000000", + "Filter": "config1=3D0x4b033", "PerPkg": "1", "PublicDescription": "TOR Occupancy : LLCPrefRFO issued by iA Core= s that hit the LLC : For each cycle, this event accumulates the number of v= alid entries in the TOR that match qualifications specified by the subevent= . Does not include addressless requests such as locks and interrupts.", "UMask": "0x11", @@ -5052,7 +5052,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_RFO", - "Filter": "config1=3D0x4003300000000", + "Filter": "config1=3D0x40033", "PerPkg": "1", "PublicDescription": "TOR Occupancy : RFOs issued by iA Cores that= Hit the LLC : For each cycle, this event accumulates the number of valid e= ntries in the TOR that match qualifications specified by the subevent. = Does not include addressless requests such as locks and interrupts.", "UMask": "0x11", @@ -5073,7 +5073,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_CRD", - "Filter": "config1=3D0x4023300000000", + "Filter": "config1=3D0x40233", "PerPkg": "1", "PublicDescription": "TOR Occupancy : CRds issued by iA Cores that= Missed the LLC : For each cycle, this event accumulates the number of vali= d entries in the TOR that match qualifications specified by the subevent. = Does not include addressless requests such as locks and interrupts.", "UMask": "0x21", @@ -5084,7 +5084,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD", - "Filter": "config1=3D0x4043300000000", + "Filter": "config1=3D0x40433", "PerPkg": "1", "PublicDescription": "TOR Occupancy : DRds issued by iA Cores that= Missed the LLC : For each cycle, this event accumulates the number of vali= d entries in the TOR that match qualifications specified by the subevent. = Does not include addressless requests such as locks and interrupts.", "UMask": "0x21", @@ -5095,7 +5095,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_LlcPrefCRD", - "Filter": "config1=3D0x4b23300000000", + "Filter": "config1=3D0x4b233", "PerPkg": "1", "UMask": "0x21", "Unit": "CHA" @@ -5105,7 +5105,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_LlcPrefDRD", - "Filter": "config1=3D0x4b43300000000", + "Filter": "config1=3D0x4b433", "PerPkg": "1", "UMask": "0x21", "Unit": "CHA" @@ -5115,7 +5115,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_LlcPrefRFO", - "Filter": "config1=3D0x4b03300000000", + "Filter": "config1=3D0x4b033", "PerPkg": "1", "PublicDescription": "TOR Occupancy : LLCPrefRFO issued by iA Core= s that missed the LLC : For each cycle, this event accumulates the number o= f valid entries in the TOR that match qualifications specified by the subev= ent. Does not include addressless requests such as locks and interrupts= .", "UMask": "0x21", @@ -5126,7 +5126,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_RFO", - "Filter": "config1=3D0x4003300000000", + "Filter": "config1=3D0x40033", "PerPkg": "1", "PublicDescription": "TOR Occupancy : RFOs issued by iA Cores that= Missed the LLC : For each cycle, this event accumulates the number of vali= d entries in the TOR that match qualifications specified by the subevent. = Does not include addressless requests such as locks and interrupts.", "UMask": "0x21", @@ -5171,7 +5171,7 @@ "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IO_MISS_ITOM", "Experimental": "1", - "Filter": "config1=3D0x4903300000000", + "Filter": "config1=3D0x49033", "PerPkg": "1", "PublicDescription": "For each cycle, this event accumulates the n= umber of valid entries in the TOR that are generated from local IO ItoM req= uests that miss the LLC. An ItoM is used by IIO to request a data write wit= hout first reading the data for ownership.", "UMask": "0x24", @@ -5183,7 +5183,7 @@ "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IO_MISS_RDCUR", "Experimental": "1", - "Filter": "config1=3D0x43c3300000000", + "Filter": "config1=3D0x43C33", "PerPkg": "1", "PublicDescription": "For each cycle, this event accumulates the n= umber of valid entries in the TOR that are generated from local IO RdCur re= quests that miss the LLC. A RdCur request is used by IIO to read data witho= ut changing state.", "UMask": "0x24", @@ -5195,7 +5195,7 @@ "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IO_MISS_RFO", "Experimental": "1", - "Filter": "config1=3D0x4003300000000", + "Filter": "config1=3D0x40033", "PerPkg": "1", "PublicDescription": "For each cycle, this event accumulates the n= umber of valid entries in the TOR that are generated from local IO RFO requ= ests that miss the LLC. A read for ownership (RFO) requests data to be cach= ed in E state with the intent to modify.", "UMask": "0x24", diff --git a/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-interconnec= t.json b/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-interconnect.js= on index 91889e447bd1..4efbfeb338a6 100644 --- a/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-interconnect.json +++ b/tools/perf/pmu-events/arch/x86/cascadelakex/uncore-interconnect.json @@ -13855,16 +13855,5 @@ "PerPkg": "1", "PublicDescription": "Number outstanding register requests within = message channel tracker", "Unit": "UBOX" - }, - { - "BriefDescription": "UPI interconnect send bandwidth for payload. = Derived from unc_upi_txl_flits.all_data", - "Counter": "0,1,2,3", - "EventCode": "0x2", - "EventName": "UPI_DATA_BANDWIDTH_TX", - "PerPkg": "1", - "PublicDescription": "Counts valid data FLITs (80 bit FLow control= unITs: 64bits of data) transmitted (TxL) via any of the 3 Intel(R) Ultra P= ath Interconnect (UPI) slots on this UPI unit.", - "ScaleUnit": "7.11E-06Bytes", - "UMask": "0xf", - "Unit": "UPI" } ] diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 20a1b8ad38be..600b3e029d24 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -6,7 +6,7 @@ GenuineIntel-6-(1C|26|27|35|36),v5,bonnell,core GenuineIntel-6-(3D|47),v30,broadwell,core GenuineIntel-6-56,v12,broadwellde,core GenuineIntel-6-4F,v23,broadwellx,core -GenuineIntel-6-55-[56789ABCDEF],v1.22,cascadelakex,core +GenuineIntel-6-55-[56789ABCDEF],v1.23,cascadelakex,core GenuineIntel-6-9[6C],v1.05,elkhartlake,core GenuineIntel-6-CF,v1.09,emeraldrapids,core GenuineIntel-6-5[CF],v13,goldmont,core --=20 2.48.0.rc2.279.g1de40edade-goog From nobody Sun Dec 14 13:49:51 2025 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7B8861946C3 for ; Thu, 16 Jan 2025 06:44:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009875; cv=none; b=nKlxgBJpYDXDv7391BCQnAN5F/IOfuv3WpuhW2UZxFWAD9BlDbYlnHaV/Bl3G5oV60Y9hqg7Uz7ahXdf8gDVTc3ipWvz9JQpnnXNpYPX9DelgVm2N0K6T0eLwzYCMuusgp2QCtsJbC8V4IwWBiVYA/TL46WmbAwo/FdRM7qBrio= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009875; c=relaxed/simple; bh=9BDEedKZdSom4BV1Ao1iw2rQtImumehu/hRSeqh1YBc=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=LNEiqxlfSTdIGLLa6Z4pahT/mk5ZL+kUzgqIMwu4SQiY7e66qWghkQ1GdrTdGUEwVRmA9suSTW7sjO46PTlWFjbyCBGAEknA5BfileMqOtbVOqyNqyQleUgybvTcLGTAvR699+W3XQ1GBWjCuteb9l6VlWUtfCuCj5sJ/r46gA4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=Ko5B0uAQ; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="Ko5B0uAQ" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e5798866415so1465214276.1 for ; Wed, 15 Jan 2025 22:44:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737009863; x=1737614663; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=aqDaUy7nIhCUAWAJW1GZ8/4IfPUYbVoEcbLnbfnRhT0=; b=Ko5B0uAQKJFNRev+csfFkjmUksQvy7eTYUPRWlB7O6AtcApHwpt7l0z0mo2wbfLKiP xK51Abgi/NoRmJLc+UbaCbO/MBw+eUVbnsxUpw4XvnCKmddjkzGMHsqJfeNQAAgKj4Jv Ax+GuC/VeQ0BvSGPdNUMPNAQSLNEwIQzJCifPMdJEHY5qraA2PSkUnw6oaYtLefqlBzf Ge226KzIoDFrUPuIgE0KxrcRynpR9yRK+kEPCtHqgwdItpTzPttLdcULWe6AGLBLSUjO WZJy0c4X/OtL3bFaj4Bt8IqP2bogwsApo+b8gYtT+nLo6gEfyp5FpxqUGUzddBNqSpz1 4DhA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737009863; x=1737614663; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=aqDaUy7nIhCUAWAJW1GZ8/4IfPUYbVoEcbLnbfnRhT0=; b=dGOMNF5Y8ObzTLcY6KAb1y009qPTLULOeDEwCItXsw8LwzOposut3rvCG/8/CvrBzo 8S2k5sUBCycJSSgYVXGcId6AdPGf1vJIfBMjYPho81CyzZCAkl4vUkLWT2+KiSw4Dvj8 jBs683EL2ptclr+WoSmzD7zMQcpynKzNNHQbd5ef7PDK6WzI5ZNWlW6X6h+lR/JXStzv YmBej1x3vklbpv16rdc046WNB2pLgTRegtyrjjUizDwNGE0XSBF3kuN9FFn0GOUnPg4e IyF6+vpOVWrUwnxlU/HnDsZieU9VGEkqTJBizVErqPd473mJc3iHWNC0EU4DXEepPtYc dZcQ== X-Forwarded-Encrypted: i=1; AJvYcCXNY1Z9ydQukk7snrfG0fF3rBpX5Jjh4J3jYUOAOw9fqIxBFC16ilBjYz4EoFeUXQ822GpFhPfj1DDNbfo=@vger.kernel.org X-Gm-Message-State: AOJu0YyV4qFafew5H4jfZGrutsJpQNpmzm6bN55/SEyFbO+EUCt9I7+o K4FbWqbZjDFKdhd2ng546TaKtcCls38hnHey9jLX+ERkxS442zBaWDS+BFN76PeH6WkAr+GMSGb 3FhKkcg== X-Google-Smtp-Source: AGHT+IH+T6YE+xySqnM58PpcOKKGJGwEKFGn+BGT87f9sE0hyb+57koBhi0493gdVmGtKCYe3PFIjrrdqqW3 X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:7bd:ce19:c701:bfa3]) (user=irogers job=sendgmr) by 2002:a25:361f:0:b0:e0e:8b26:484e with SMTP id 3f1490d57ef6-e54ee20cc49mr67506276.8.1737009863385; Wed, 15 Jan 2025 22:44:23 -0800 (PST) Date: Wed, 15 Jan 2025 22:43:40 -0800 In-Reply-To: <20250116064355.344245-1-irogers@google.com> Message-Id: <20250116064355.344245-9-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250116064355.344245-1-irogers@google.com> X-Mailer: git-send-email 2.48.0.rc2.279.g1de40edade-goog Subject: [PATCH v2 08/23] perf vendor events: Add Clearwaterforest events From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add events v1.00. Bring in the events from: https://github.com/intel/perfmon/tree/main/CWF/events Co-developed-by: Caleb Biggers Signed-off-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../arch/x86/clearwaterforest/cache.json | 144 ++++++++++++++++++ .../arch/x86/clearwaterforest/counter.json | 7 + .../arch/x86/clearwaterforest/frontend.json | 18 +++ .../arch/x86/clearwaterforest/memory.json | 22 +++ .../arch/x86/clearwaterforest/other.json | 22 +++ .../arch/x86/clearwaterforest/pipeline.json | 113 ++++++++++++++ .../x86/clearwaterforest/virtual-memory.json | 29 ++++ tools/perf/pmu-events/arch/x86/mapfile.csv | 1 + 8 files changed, 356 insertions(+) create mode 100644 tools/perf/pmu-events/arch/x86/clearwaterforest/cache.j= son create mode 100644 tools/perf/pmu-events/arch/x86/clearwaterforest/counter= .json create mode 100644 tools/perf/pmu-events/arch/x86/clearwaterforest/fronten= d.json create mode 100644 tools/perf/pmu-events/arch/x86/clearwaterforest/memory.= json create mode 100644 tools/perf/pmu-events/arch/x86/clearwaterforest/other.j= son create mode 100644 tools/perf/pmu-events/arch/x86/clearwaterforest/pipelin= e.json create mode 100644 tools/perf/pmu-events/arch/x86/clearwaterforest/virtual= -memory.json diff --git a/tools/perf/pmu-events/arch/x86/clearwaterforest/cache.json b/t= ools/perf/pmu-events/arch/x86/clearwaterforest/cache.json new file mode 100644 index 000000000000..875361b30f1d --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/clearwaterforest/cache.json @@ -0,0 +1,144 @@ +[ + { + "BriefDescription": "Counts the number of cacheable memory request= s that miss in the LLC. Counts on a per core basis.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x2e", + "EventName": "LONGEST_LAT_CACHE.MISS", + "PublicDescription": "Counts the number of cacheable memory reques= ts that miss in the Last Level Cache (LLC). Requests include demand loads, = reads for ownership (RFO), instruction fetches and L1 HW prefetches. If the= core has access to an L3 cache, the LLC is the L3 cache, otherwise it is t= he L2 cache. Counts on a per core basis.", + "SampleAfterValue": "1000003", + "UMask": "0x41" + }, + { + "BriefDescription": "Counts the number of cacheable memory request= s that access the LLC. Counts on a per core basis.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x2e", + "EventName": "LONGEST_LAT_CACHE.REFERENCE", + "PublicDescription": "Counts the number of cacheable memory reques= ts that access the Last Level Cache (LLC). Requests include demand loads, r= eads for ownership (RFO), instruction fetches and L1 HW prefetches. If the = core has access to an L3 cache, the LLC is the L3 cache, otherwise it is th= e L2 cache. Counts on a per core basis.", + "SampleAfterValue": "1000003", + "UMask": "0x4f" + }, + { + "BriefDescription": "Counts the number of load ops retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.ALL_LOADS", + "SampleAfterValue": "1000003", + "UMask": "0x81" + }, + { + "BriefDescription": "Counts the number of store ops retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.ALL_STORES", + "SampleAfterValue": "1000003", + "UMask": "0x82" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled.", + "Counter": "0,1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_1024", + "MSRIndex": "0x3F6", + "MSRValue": "0x400", + "SampleAfterValue": "1000003", + "UMask": "0x5" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled.", + "Counter": "0,1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_128", + "MSRIndex": "0x3F6", + "MSRValue": "0x80", + "SampleAfterValue": "1000003", + "UMask": "0x5" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled.", + "Counter": "0,1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_16", + "MSRIndex": "0x3F6", + "MSRValue": "0x10", + "SampleAfterValue": "1000003", + "UMask": "0x5" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled.", + "Counter": "0,1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_2048", + "MSRIndex": "0x3F6", + "MSRValue": "0x800", + "SampleAfterValue": "1000003", + "UMask": "0x5" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled.", + "Counter": "0,1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_256", + "MSRIndex": "0x3F6", + "MSRValue": "0x100", + "SampleAfterValue": "1000003", + "UMask": "0x5" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled.", + "Counter": "0,1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_32", + "MSRIndex": "0x3F6", + "MSRValue": "0x20", + "SampleAfterValue": "1000003", + "UMask": "0x5" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled.", + "Counter": "0,1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_4", + "MSRIndex": "0x3F6", + "MSRValue": "0x4", + "SampleAfterValue": "1000003", + "UMask": "0x5" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled.", + "Counter": "0,1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_512", + "MSRIndex": "0x3F6", + "MSRValue": "0x200", + "SampleAfterValue": "1000003", + "UMask": "0x5" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled.", + "Counter": "0,1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_64", + "MSRIndex": "0x3F6", + "MSRValue": "0x40", + "SampleAfterValue": "1000003", + "UMask": "0x5" + }, + { + "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled.", + "Counter": "0,1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_8", + "MSRIndex": "0x3F6", + "MSRValue": "0x8", + "SampleAfterValue": "1000003", + "UMask": "0x5" + }, + { + "BriefDescription": "Counts the number of stores uops retired sam= e as MEM_UOPS_RETIRED.ALL_STORES", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STORE_LATENCY", + "SampleAfterValue": "1000003", + "UMask": "0x6" + } +] diff --git a/tools/perf/pmu-events/arch/x86/clearwaterforest/counter.json b= /tools/perf/pmu-events/arch/x86/clearwaterforest/counter.json new file mode 100644 index 000000000000..a0eaf5b6f2bc --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/clearwaterforest/counter.json @@ -0,0 +1,7 @@ +[ + { + "Unit": "core", + "CountersNumFixed": "3", + "CountersNumGeneric": "39" + } +] \ No newline at end of file diff --git a/tools/perf/pmu-events/arch/x86/clearwaterforest/frontend.json = b/tools/perf/pmu-events/arch/x86/clearwaterforest/frontend.json new file mode 100644 index 000000000000..7a9250e5c8f2 --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/clearwaterforest/frontend.json @@ -0,0 +1,18 @@ +[ + { + "BriefDescription": "Counts every time the code stream enters into= a new cache line by walking sequential from the previous line or being red= irected by a jump.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x80", + "EventName": "ICACHE.ACCESSES", + "SampleAfterValue": "1000003", + "UMask": "0x3" + }, + { + "BriefDescription": "Counts every time the code stream enters into= a new cache line by walking sequential from the previous line or being red= irected by a jump and the instruction cache registers bytes are not present= . -", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x80", + "EventName": "ICACHE.MISSES", + "SampleAfterValue": "1000003", + "UMask": "0x2" + } +] diff --git a/tools/perf/pmu-events/arch/x86/clearwaterforest/memory.json b/= tools/perf/pmu-events/arch/x86/clearwaterforest/memory.json new file mode 100644 index 000000000000..f5007e56f39b --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/clearwaterforest/memory.json @@ -0,0 +1,22 @@ +[ + { + "BriefDescription": "Counts demand data reads that were not suppli= ed by the L3 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_DATA_RD.L3_MISS", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x33FBFC00001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that were no= t supplied by the L3 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_RFO.L3_MISS", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x33FBFC00002", + "SampleAfterValue": "100003", + "UMask": "0x1" + } +] diff --git a/tools/perf/pmu-events/arch/x86/clearwaterforest/other.json b/t= ools/perf/pmu-events/arch/x86/clearwaterforest/other.json new file mode 100644 index 000000000000..80454e497f83 --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/clearwaterforest/other.json @@ -0,0 +1,22 @@ +[ + { + "BriefDescription": "Counts demand data reads that have any type o= f response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that have an= y type of response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10002", + "SampleAfterValue": "100003", + "UMask": "0x1" + } +] diff --git a/tools/perf/pmu-events/arch/x86/clearwaterforest/pipeline.json = b/tools/perf/pmu-events/arch/x86/clearwaterforest/pipeline.json new file mode 100644 index 000000000000..6a5faa704b85 --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/clearwaterforest/pipeline.json @@ -0,0 +1,113 @@ +[ + { + "BriefDescription": "Counts the total number of branch instruction= s retired for all branch types.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.ALL_BRANCHES", + "PublicDescription": "Counts the total number of instructions in w= hich the instruction pointer (IP) of the processor is resteered due to a br= anch instruction and the branch instruction successfully retires. All bran= ch type instructions are accounted for.", + "SampleAfterValue": "1000003" + }, + { + "BriefDescription": "Counts the total number of mispredicted branc= h instructions retired for all branch types.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", + "PublicDescription": "Counts the total number of mispredicted bran= ch instructions retired. All branch type instructions are accounted for. = Prediction of the branch target address enables the processor to begin exec= uting instructions before the non-speculative execution path is known. The = branch prediction unit (BPU) predicts the target address based on the instr= uction pointer (IP) of the branch and on the execution path through which e= xecution reached this IP. A branch misprediction occurs when the predict= ion is wrong, and results in discarding all instructions executed in the sp= eculative path and re-fetching from the correct path.", + "SampleAfterValue": "1000003" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of unhalted = core clock cycles.", + "Counter": "Fixed counter 1", + "EventName": "CPU_CLK_UNHALTED.CORE", + "SampleAfterValue": "1000003", + "UMask": "0x2" + }, + { + "BriefDescription": "Counts the number of unhalted core clock cycl= es. [This event is alias to CPU_CLK_UNHALTED.THREAD_P]", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.CORE_P", + "SampleAfterValue": "1000003" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of unhalted = reference clock cycles.", + "Counter": "Fixed counter 2", + "EventName": "CPU_CLK_UNHALTED.REF_TSC", + "SampleAfterValue": "1000003", + "UMask": "0x3" + }, + { + "BriefDescription": "Counts the number of unhalted reference clock= cycles at TSC frequency.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.REF_TSC_P", + "PublicDescription": "Counts the number of reference cycles that t= he core is not in a halt state. The core enters the halt state when it is r= unning the HLT instruction. This event is not affected by core frequency ch= anges and increments at a fixed frequency that is also used for the Time St= amp Counter (TSC). This event uses a programmable general purpose performan= ce counter.", + "SampleAfterValue": "1000003", + "UMask": "0x1" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of unhalted = core clock cycles.", + "Counter": "Fixed counter 1", + "EventName": "CPU_CLK_UNHALTED.THREAD", + "SampleAfterValue": "1000003", + "UMask": "0x2" + }, + { + "BriefDescription": "Counts the number of unhalted core clock cycl= es. [This event is alias to CPU_CLK_UNHALTED.CORE_P]", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.THREAD_P", + "SampleAfterValue": "1000003" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of instructi= ons retired.", + "Counter": "Fixed counter 0", + "EventName": "INST_RETIRED.ANY", + "SampleAfterValue": "1000003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts the number of instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc0", + "EventName": "INST_RETIRED.ANY_P", + "SampleAfterValue": "1000003" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of issue slo= ts that were not consumed by the backend because allocation is stalled due = to a mispredicted jump or a machine clear.", + "Counter": "36", + "EventName": "TOPDOWN_BAD_SPECULATION.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x5" + }, + { + "BriefDescription": "Counts the number of retirement slots not con= sumed due to backend stalls. [This event is alias to TOPDOWN_BE_BOUND.ALL_P= ]", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xa4", + "EventName": "TOPDOWN_BE_BOUND.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x2" + }, + { + "BriefDescription": "Counts the number of retirement slots not con= sumed due to backend stalls. [This event is alias to TOPDOWN_BE_BOUND.ALL]", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xa4", + "EventName": "TOPDOWN_BE_BOUND.ALL_P", + "SampleAfterValue": "1000003", + "UMask": "0x2" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of retiremen= t slots not consumed due to front end stalls.", + "Counter": "37", + "EventName": "TOPDOWN_FE_BOUND.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x6" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of consumed = retirement slots.", + "Counter": "38", + "EventName": "TOPDOWN_RETIRING.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x7" + } +] diff --git a/tools/perf/pmu-events/arch/x86/clearwaterforest/virtual-memory= .json b/tools/perf/pmu-events/arch/x86/clearwaterforest/virtual-memory.json new file mode 100644 index 000000000000..78f2b835c1fa --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/clearwaterforest/virtual-memory.json @@ -0,0 +1,29 @@ +[ + { + "BriefDescription": "Counts the number of page walks completed due= to load DTLB misses to any page size.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED", + "PublicDescription": "Counts the number of page walks completed du= e to loads (including SW prefetches) whose address translations missed in a= ll Translation Lookaside Buffer (TLB) levels and were mapped to any page si= ze. Includes page walks that page fault.", + "SampleAfterValue": "1000003", + "UMask": "0xe" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to store DTLB misses to any page size.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED", + "PublicDescription": "Counts the number of page walks completed du= e to stores whose address translations missed in all Translation Lookaside = Buffer (TLB) levels and were mapped to any page size. Includes page walks = that page fault.", + "SampleAfterValue": "1000003", + "UMask": "0xe" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to instruction fetch misses to any page size.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.WALK_COMPLETED", + "PublicDescription": "Counts the number of page walks completed du= e to instruction fetches whose address translations missed in all Translati= on Lookaside Buffer (TLB) levels and were mapped to any page size. Include= s page walks that page fault.", + "SampleAfterValue": "1000003", + "UMask": "0xe" + } +] diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 600b3e029d24..88919998da3d 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -7,6 +7,7 @@ GenuineIntel-6-(3D|47),v30,broadwell,core GenuineIntel-6-56,v12,broadwellde,core GenuineIntel-6-4F,v23,broadwellx,core GenuineIntel-6-55-[56789ABCDEF],v1.23,cascadelakex,core +GenuineIntel-6-DD,v1.00,clearwaterforest,core GenuineIntel-6-9[6C],v1.05,elkhartlake,core GenuineIntel-6-CF,v1.09,emeraldrapids,core GenuineIntel-6-5[CF],v13,goldmont,core --=20 2.48.0.rc2.279.g1de40edade-goog From nobody Sun Dec 14 13:49:51 2025 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E6A33192B95 for ; Thu, 16 Jan 2025 06:44:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009899; cv=none; b=Y1bdopfSX5HfdZ0bIT+kQ5MPv2xJjch2qIaZCQEjsLAPF86rKAubanTmqPdVNqi7zW5j9oB49U6depwZPkOKnbXxhqwzmDnLuugCfS61ts28rgTUBnhDar0wY4SDbYzyGsAmb0f78KH0jvVx9Wvnz1EarQZ9/i3wNou0TsLgjM0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009899; c=relaxed/simple; bh=LxyvEbS3+XRtXxXXzvhBtjeUx1aXggYyBfaIJFnFrqU=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=sVkcKSzu5w7sfGGS+ykxyEZ0IZNEcKKUnH+9a7gEsLyA59i+nGmQiLUMtM5rDCYQG9dXmJsori4MH6+zldWu/4qD94r6+Z3zVVwfYMZCtJmrjWK4okU25U4QgmkA3muxoePrI+0Fm/bnOcR0FdWO9bxOSjf8kyscbyNIGOvt+rQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=SMxmhEM3; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="SMxmhEM3" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e5740c858beso1725769276.2 for ; Wed, 15 Jan 2025 22:44:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737009866; x=1737614666; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=rG74xDWr5K0wm0XkjULcXXQONSujqHTztKR/aslcF30=; b=SMxmhEM3z5Jt8jgQLWgSLmjEdBiMddwLTQ/JgN50q4NHhtfUATV76XIp0Jiutm+PCE OnMCQzu04pnvt9Zyl+2wpoxWRo1z7KbBvzY42Zl/sk34Wd8Sf9GfXXX/enK8SSOy/pJG vxVWcyD8KneIvBSzyDRFwdJ4PYHC5Z+Ndl7SQJgoFZ7ZlwcetkoO2464zuRyqazLZeiK 2B7l7dURXz2hKrc0Al/Oiob8VrMgY7WIgru5aG2lfzAdUanISnKyO9peCnZskrJetmSq rtKESADgIpzg2m9KTe1ExXTLz1T55UdHKULSWNf3wyUiQGr9cxYQILW8oRxGkT+aM0PC /lkw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737009866; x=1737614666; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=rG74xDWr5K0wm0XkjULcXXQONSujqHTztKR/aslcF30=; b=XWSSQXAsw5kZqNuG5t9B55U1jy6A0SYL7nL7/e9AS8LofzoXaIOcfEsk5l+fDBjvHF yP/dfSbvmcWuEEZj0WXhe48nXOe2blAdFWsHlcMXPSK2StFPo9FkTZKdEYr3S++mQLJp LWxTKAGE/RYlXwPR9SMVS7tZeNWwAF05oyuuHJ8KVtAgKO5ydJc1m9pn7QL6TFNtkkI+ TBYPfWhGITY5Ez4UvWg65mMrytnPDA+BGqiZUvC/ha+Yl6iqd9z4mghge9HF/0pjQdTP RmupAOd3b/LplSU9fJPf2Aa6jdL1KC+7Vs4ncrvK9itXuA6MRPTwaGjpaESzKXZQK2tM re+A== X-Forwarded-Encrypted: i=1; AJvYcCXTuYoRa7B72ja4f8a3BRiFoFNsy++Nh4GBe8MBknZYqTFitKEjY0gWwCDRD1DvNk+1Aojre+Ye5XlPWdM=@vger.kernel.org X-Gm-Message-State: AOJu0YyqKHvgv82/Epcsoge3tpEdezK4GP9PiMXm9c2x6K3Bo1UHAOgS vZwJTGrJiPjcu9Nqni9w/kdNjvGjdTyEZoJ8Lt3/bpaw52TvvYmBma2a2m2eLbDixsnoFQ9hEyc 3TJs5Kg== X-Google-Smtp-Source: AGHT+IGcBfZX7Pk3sZjKzhCNpMwnQbYogJF3739cd8a2I5T9bRnF+wTIwioPIbhYbfb3j6fuS/VEN7PRJUFe X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:7bd:ce19:c701:bfa3]) (user=irogers job=sendgmr) by 2002:a05:690c:883:b0:6ef:801f:4386 with SMTP id 00721157ae682-6f531328f0dmr809787b3.6.1737009866133; Wed, 15 Jan 2025 22:44:26 -0800 (PST) Date: Wed, 15 Jan 2025 22:43:41 -0800 In-Reply-To: <20250116064355.344245-1-irogers@google.com> Message-Id: <20250116064355.344245-10-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250116064355.344245-1-irogers@google.com> X-Mailer: git-send-email 2.48.0.rc2.279.g1de40edade-goog Subject: [PATCH v2 09/23] perf vendor events: Update EmeraldRapids events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.09 to v1.11. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v1.11: https://github.com/intel/perfmon/commit/bffcec00a184bb93d505f182047cf889d12= 4fbd5 https://github.com/intel/perfmon/commit/a63da6de48046c365ab91c5001bfd5d907d= 5a1d6 The TMA 5.01 update is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Update uncore IIO events umask with the change: https://github.com/intel/perfmon/commit/d78e8a166537c9ceab4f2e901dc96c53667= a2174 which should address an issue originally raised by Michael Petlan: Reported-by: Michael Petlan Closes: https://lore.kernel.org/all/alpine.LRH.2.20.2401300733310.11354@Die= go/ Co-developed-by: Caleb Biggers Signed-off-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../arch/x86/emeraldrapids/cache.json | 28 +- .../arch/x86/emeraldrapids/emr-metrics.json | 1036 +++++++++-------- .../arch/x86/emeraldrapids/frontend.json | 19 - .../arch/x86/emeraldrapids/memory.json | 15 +- .../arch/x86/emeraldrapids/metricgroups.json | 10 +- .../arch/x86/emeraldrapids/pipeline.json | 23 - .../arch/x86/emeraldrapids/uncore-io.json | 218 ++-- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 8 files changed, 626 insertions(+), 725 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/emeraldrapids/cache.json b/tool= s/perf/pmu-events/arch/x86/emeraldrapids/cache.json index 21d5d96b8a6d..3b0581151d63 100644 --- a/tools/perf/pmu-events/arch/x86/emeraldrapids/cache.json +++ b/tools/perf/pmu-events/arch/x86/emeraldrapids/cache.json @@ -92,11 +92,11 @@ "UMask": "0x2" }, { - "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache when triggered by an L2 cache fill.", + "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache.", "Counter": "0,1,2,3", "EventCode": "0x26", "EventName": "L2_LINES_OUT.SILENT", - "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache when triggered by an L2 cache fill. These lines are ty= pically in Shared or Exclusive state. A non-threaded event.", + "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache. These lines are typically in Shared or Exclusive stat= e. A non-threaded event.", "SampleAfterValue": "200003", "UMask": "0x1" }, @@ -311,7 +311,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_LOADS", - "PEBS": "1", "PublicDescription": "Counts all retired load instructions. This e= vent accounts for SW prefetch instructions of PREFETCHNTA or PREFETCHT0/1/2= or PREFETCHW.", "SampleAfterValue": "1000003", "UMask": "0x81" @@ -322,7 +321,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_STORES", - "PEBS": "1", "PublicDescription": "Counts all retired store instructions.", "SampleAfterValue": "1000003", "UMask": "0x82" @@ -333,7 +331,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts all retired memory instructions - loa= ds and stores.", "SampleAfterValue": "1000003", "UMask": "0x83" @@ -344,7 +341,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.LOCK_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with locked= access.", "SampleAfterValue": "100007", "UMask": "0x21" @@ -355,7 +351,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions that split = across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x41" @@ -366,7 +361,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_STORES", - "PEBS": "1", "PublicDescription": "Counts retired store instructions that split= across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x42" @@ -377,7 +371,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_LOADS", - "PEBS": "1", "PublicDescription": "Number of retired load instructions that (st= art a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x11" @@ -388,7 +381,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_STORES", - "PEBS": "1", "PublicDescription": "Number of retired store instructions that (s= tart a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x12" @@ -408,7 +400,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were HitM responses from shared L3.", "SampleAfterValue": "20011", "UMask": "0x4" @@ -419,7 +410,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS", - "PEBS": "1", "PublicDescription": "Counts the retired load instructions whose d= ata sources were L3 hit and cross-core snoop missed in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x1" @@ -430,7 +420,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NONE", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were hits in L3 without snoops required.", "SampleAfterValue": "100003", "UMask": "0x8" @@ -441,7 +430,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were L3 and cross-core snoop hits in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x2" @@ -452,7 +440,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM", - "PEBS": "1", "PublicDescription": "Retired load instructions which data sources= missed L3 but serviced from local DRAM.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -463,7 +450,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM", - "PEBS": "1", "SampleAfterValue": "1000003", "UMask": "0x2" }, @@ -473,7 +459,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD", - "PEBS": "1", "PublicDescription": "Retired load instructions whose data sources= was forwarded from a remote cache.", "SampleAfterValue": "100007", "UMask": "0x8" @@ -484,7 +469,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM", - "PEBS": "1", "SampleAfterValue": "1000003", "UMask": "0x4" }, @@ -494,7 +478,6 @@ "Data_LA": "1", "EventCode": "0xd4", "EventName": "MEM_LOAD_MISC_RETIRED.UC", - "PEBS": "1", "PublicDescription": "Retired instructions with at least one load = to uncacheable memory-type, or at least one cache-line split locked access = (Bus Lock).", "SampleAfterValue": "100007", "UMask": "0x4" @@ -505,7 +488,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.FB_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop was load missed in L1 but hit FB (Fill Buffers) due to preceding= miss to the same cache line with data not ready.", "SampleAfterValue": "100007", "UMask": "0x40" @@ -516,7 +498,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L1 data cache. This event includes all SW prefet= ches and lock instructions regardless of the data source.", "SampleAfterValue": "1000003", "UMask": "0x1" @@ -527,7 +508,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L1 cache.", "SampleAfterValue": "200003", "UMask": "0x8" @@ -538,7 +518,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with L2 cac= he hits as data sources.", "SampleAfterValue": "200003", "UMask": "0x2" @@ -549,7 +528,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions missed L2 c= ache as data sources.", "SampleAfterValue": "100021", "UMask": "0x10" @@ -560,7 +538,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L3 cache.", "SampleAfterValue": "100021", "UMask": "0x4" @@ -571,7 +548,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L3 cache.", "SampleAfterValue": "50021", "UMask": "0x20" diff --git a/tools/perf/pmu-events/arch/x86/emeraldrapids/emr-metrics.json = b/tools/perf/pmu-events/arch/x86/emeraldrapids/emr-metrics.json index ee288099a8d3..73bc185b39da 100644 --- a/tools/perf/pmu-events/arch/x86/emeraldrapids/emr-metrics.json +++ b/tools/perf/pmu-events/arch/x86/emeraldrapids/emr-metrics.json @@ -34,7 +34,7 @@ "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles.", + "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles", "MetricExpr": "CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY", "MetricName": "cpi", "ScaleUnit": "1per_instr" @@ -55,85 +55,73 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte page sizes) caused by demand data loads to the total number of c= ompleted instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRE= D.ANY", "MetricName": "dtlb_2nd_level_2mb_large_page_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte page sizes) caused by demand data loads to the total number of = completed instructions. This implies it missed in the Data Translation Look= aside Buffer (DTLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte page sizes) caused by demand data loads to the total number of = completed instructions. This implies it missed in the Data Translation Look= aside Buffer (DTLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data loads to the total number of complete= d instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "dtlb_2nd_level_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data stores to the total number of complet= ed instructions", "MetricExpr": "DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", "MetricName": "dtlb_2nd_level_store_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB", "ScaleUnit": "1per_instr" }, { - "BriefDescription": "Bandwidth observed by the integrated I/O traf= fic controller (IIO) of IO reads that are initiated by end device controlle= rs that are requesting memory from the CPU.", - "MetricExpr": "UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.ALL_PARTS * 4 / 1e= 6 / duration_time", - "MetricName": "iio_bandwidth_read", - "ScaleUnit": "1MB/s" - }, - { - "BriefDescription": "Bandwidth observed by the integrated I/O traf= fic controller (IIO) of IO writes that are initiated by end device controll= ers that are writing memory to the CPU.", - "MetricExpr": "UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.ALL_PARTS * 4 / 1= e6 / duration_time", - "MetricName": "iio_bandwidth_write", - "ScaleUnit": "1MB/s" - }, - { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU.", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e6 / durati= on_time", "MetricName": "io_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the local CPU socket.= ", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the local CPU socket", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR_LOCAL * 64 / 1e6 / = duration_time", "MetricName": "io_bandwidth_read_local", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from a remote CPU socket.", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from a remote CPU socket", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR_REMOTE * 64 / 1e6 /= duration_time", "MetricName": "io_bandwidth_read_remote", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM + UNC_CHA_TOR_INSERTS.= IO_ITOMCACHENEAR) * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_write", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the local CPU socket.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the local CPU socket", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM_LOCAL + UNC_CHA_TOR_IN= SERTS.IO_ITOMCACHENEAR_LOCAL) * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_write_local", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to a remote CPU socket.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to a remote CPU socket", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM_REMOTE + UNC_CHA_TOR_I= NSERTS.IO_ITOMCACHENEAR_REMOTE) * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_write_remote", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Percentage of inbound full cacheline writes i= nitiated by end device controllers that miss the L3 cache.", + "BriefDescription": "Percentage of inbound full cacheline writes i= nitiated by end device controllers that miss the L3 cache", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_MISS_ITOM / UNC_CHA_TOR_INSE= RTS.IO_ITOM", "MetricName": "io_percent_of_inbound_full_writes_that_miss_l3", "ScaleUnit": "100%" }, { - "BriefDescription": "Percentage of inbound partial cacheline write= s initiated by end device controllers that miss the L3 cache.", + "BriefDescription": "Percentage of inbound partial cacheline write= s initiated by end device controllers that miss the L3 cache", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_MISS_ITOMCACHENEAR + UNC_CH= A_TOR_INSERTS.IO_MISS_RFO) / (UNC_CHA_TOR_INSERTS.IO_ITOMCACHENEAR + UNC_CH= A_TOR_INSERTS.IO_RFO)", "MetricName": "io_percent_of_inbound_partial_writes_that_miss_l3", "ScaleUnit": "100%" }, { - "BriefDescription": "Percentage of inbound reads initiated by end = device controllers that miss the L3 cache.", + "BriefDescription": "Percentage of inbound reads initiated by end = device controllers that miss the L3 cache", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_MISS_PCIRDCUR / UNC_CHA_TOR_= INSERTS.IO_PCIRDCUR", "MetricName": "io_percent_of_inbound_reads_that_miss_l3", "ScaleUnit": "100%" @@ -142,14 +130,14 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total n= umber of completed instructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY= ", "MetricName": "itlb_2nd_level_large_page_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by a code fetch to the total number of completed ins= tructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "itlb_2nd_level_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB", "ScaleUnit": "1per_instr" }, { @@ -230,36 +218,6 @@ "MetricName": "llc_demand_data_read_miss_to_dram_latency", "ScaleUnit": "1ns" }, - { - "BriefDescription": "Average latency of a last level cache (LLC) d= emand data read miss (read memory access) addressed to Intel(R) Optane(TM) = Persistent Memory(PMEM) in nano seconds", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_PMM / UNC_= CHA_TOR_INSERTS.IA_MISS_DRD_PMM) / (UNC_CHA_CLOCKTICKS / (source_count(UNC_= CHA_TOR_OCCUPANCY.IA_MISS_DRD_PMM) * #num_packages)) * duration_time", - "MetricName": "llc_demand_data_read_miss_to_pmem_latency", - "ScaleUnit": "1ns" - }, - { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to local memory.", - "MetricExpr": "UNC_CHA_REQUESTS.READS_LOCAL * 64 / 1e6 / duration_= time", - "MetricName": "llc_miss_local_memory_bandwidth_read", - "ScaleUnit": "1MB/s" - }, - { - "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to local memory.", - "MetricExpr": "UNC_CHA_REQUESTS.WRITES_LOCAL * 64 / 1e6 / duration= _time", - "MetricName": "llc_miss_local_memory_bandwidth_write", - "ScaleUnit": "1MB/s" - }, - { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to remote memory.", - "MetricExpr": "UNC_CHA_REQUESTS.READS_REMOTE * 64 / 1e6 / duration= _time", - "MetricName": "llc_miss_remote_memory_bandwidth_read", - "ScaleUnit": "1MB/s" - }, - { - "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to remote memory.", - "MetricExpr": "UNC_CHA_REQUESTS.WRITES_REMOTE * 64 / 1e6 / duratio= n_time", - "MetricName": "llc_miss_remote_memory_bandwidth_write", - "ScaleUnit": "1MB/s" - }, { "BriefDescription": "The ratio of number of completed memory load = instructions to the total number completed instructions", "MetricExpr": "MEM_INST_RETIRED.ALL_LOADS / INST_RETIRED.ANY", @@ -285,19 +243,19 @@ "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Memory write bandwidth (MB/sec) caused by dir= ectory updates; includes DDR and Intel(R) Optane(TM) Persistent Memory(PMEM= ).", + "BriefDescription": "Memory write bandwidth (MB/sec) caused by dir= ectory updates; includes DDR and Intel(R) Optane(TM) Persistent Memory(PMEM= )", "MetricExpr": "(UNC_CHA_DIR_UPDATE.HA + UNC_CHA_DIR_UPDATE.TOR + U= NC_M2M_DIRECTORY_UPDATE.ANY) * 64 / 1e6 / duration_time", "MetricName": "memory_extra_write_bw_due_to_directory_updates", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches.", + "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCAL + UNC_CHA_TO= R_INSERTS.IA_MISS_DRD_PREF_LOCAL) / (UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCAL = + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_LOCAL + UNC_CHA_TOR_INSERTS.IA_MISS_= DRD_REMOTE + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_REMOTE)", "MetricName": "numa_reads_addressed_to_local_dram", "ScaleUnit": "100%" }, { - "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches.", + "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_DRD_REMOTE + UNC_CHA_T= OR_INSERTS.IA_MISS_DRD_PREF_REMOTE) / (UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCA= L + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_LOCAL + UNC_CHA_TOR_INSERTS.IA_MIS= S_DRD_REMOTE + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_REMOTE)", "MetricName": "numa_reads_addressed_to_remote_dram", "ScaleUnit": "100%" @@ -320,24 +278,6 @@ "MetricName": "percent_uops_delivered_from_microcode_sequencer", "ScaleUnit": "100%" }, - { - "BriefDescription": "Intel(R) Optane(TM) Persistent Memory(PMEM) m= emory read bandwidth (MB/sec)", - "MetricExpr": "UNC_M_PMM_RPQ_INSERTS * 64 / 1e6 / duration_time", - "MetricName": "pmem_memory_bandwidth_read", - "ScaleUnit": "1MB/s" - }, - { - "BriefDescription": "Intel(R) Optane(TM) Persistent Memory(PMEM) m= emory bandwidth (MB/sec)", - "MetricExpr": "(UNC_M_PMM_RPQ_INSERTS + UNC_M_PMM_WPQ_INSERTS) * 6= 4 / 1e6 / duration_time", - "MetricName": "pmem_memory_bandwidth_total", - "ScaleUnit": "1MB/s" - }, - { - "BriefDescription": "Intel(R) Optane(TM) Persistent Memory(PMEM) m= emory write bandwidth (MB/sec)", - "MetricExpr": "UNC_M_PMM_WPQ_INSERTS * 64 / 1e6 / duration_time", - "MetricName": "pmem_memory_bandwidth_write", - "ScaleUnit": "1MB/s" - }, { "BriefDescription": "Percentage of cycles spent in System Manageme= nt Interrupts.", "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0= else 0)", @@ -360,7 +300,7 @@ "ScaleUnit": "1per_instr" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + = UOPS_DISPATCHED.PORT_5_11 + UOPS_DISPATCHED.PORT_6) / (5 * tma_info_core_co= re_clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -372,7 +312,7 @@ "MetricExpr": "EXE.AMX_BUSY / tma_info_core_core_clks", "MetricGroup": "BvCB;Compute;HPC;Server;TopdownL3;tma_L3_group;tma= _core_bound_group", "MetricName": "tma_amx_busy", - "MetricThreshold": "tma_amx_busy > 0.5 & (tma_core_bound > 0.1 & t= ma_backend_bound > 0.2)", + "MetricThreshold": "tma_amx_busy > 0.5 & tma_core_bound > 0.1 & tm= a_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -380,12 +320,12 @@ "MetricExpr": "78 * ASSISTS.ANY / tma_info_thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists.", + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists", "MetricExpr": "63 * ASSISTS.SSE_AVX_MIX / tma_info_thread_slots", "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_avx_assists", @@ -394,35 +334,125 @@ }, { "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", - "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_inf= o_thread_slots", - "MetricGroup": "BvOB;Default;TmaL1;TopdownL1;tma_L1_group", + "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "BvOB;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", - "MetricgroupNoGroup": "TopdownL1;Default", + "MetricgroupNoGroup": "TopdownL1", "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots wa= sted due to incorrect speculations", - "DefaultMetricgroupName": "TopdownL1", "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + t= ma_retiring), 0)", - "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", - "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%" }, + { + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Ret", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_fb_full)= ))", + "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_= bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_d= ram_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtlb_load + tma_st= ore_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_load= s + tma_fb_full)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_= l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_split_l= oads / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma= _lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bound * (tma_s= tore_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_split_stores / (tma_store_latency + tma_false_sha= ring + tma_split_stores + tma_streaming_stores + tma_dtlb_store)) + tma_mem= ory_bound * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound = + tma_dram_bound + tma_store_bound)) * (tma_store_latency / (tma_store_late= ncy + tma_false_sharing + tma_split_stores + tma_streaming_stores + tma_dtl= b_store)))", + "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_amx_busy + tma_ports_utilization) + tma_c= ore_bound * tma_amx_busy / (tma_divider + tma_serializing_operation + tma_a= mx_busy + tma_ports_utilization) + tma_core_bound * (tma_ports_utilization = / (tma_divider + tma_serializing_operation + tma_amx_busy + tma_ports_utili= zation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_utili= zed_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.MS\\,cmask\\=3D0x1@) * (tma_fe= tch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_resteers= + tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispredict= s) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_branches= )) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_sw= itches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms / (tma_= mite + tma_dsb + tma_ms))) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Fed;FetchBW;Frontend", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu@UOPS_R= ETIRED.MS\\,cmask\\=3D0x1@) * (tma_fetch_latency * (tma_ms_switches + tma_b= ranch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma_othe= r_mispredicts / tma_branch_mispredicts) / (tma_mispredicts_resteers + tma_c= lears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_itlb_mis= ses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) += tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_ms)) + 10 * tma_m= icrocode_sequencer * tma_other_mispredicts / tma_branch_mispredicts * tma_b= ranch_mispredicts + tma_machine_clears * tma_other_nukes / tma_other_nukes = + tma_core_bound * (tma_serializing_operation + RS.EMPTY_RESOURCE / tma_inf= o_thread_clks * tma_ports_utilized_0) / (tma_divider + tma_serializing_oper= ation + tma_amx_busy + tma_ports_utilization) + tma_microcode_sequencer / (= tma_few_uops_instructions + tma_microcode_sequencer) * (tma_assists / tma_m= icrocode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_bound += tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_dt= lb_store / (tma_store_latency + tma_false_sharing + tma_split_stores + tma_= streaming_stores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) * tma_remote_cach= e / (tma_local_mem + tma_remote_mem + tma_remote_cache) + tma_l3_bound / (t= ma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_boun= d) * (tma_contested_accesses + tma_data_sharing) / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / = (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bo= und) * tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_spl= it_stores + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + t= ma_machine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears, tma_remote_cache" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Offcore", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Ret", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20" + }, { "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", - "DefaultMetricgroupName": "TopdownL2", - "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tm= a_info_thread_slots", - "MetricGroup": "BadSpec;BrMispredicts;BvMP;Default;TmaL2;TopdownL2= ;tma_L2_group;tma_bad_speculation_group;tma_issueBM", + "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * sl= ots", + "MetricGroup": "BadSpec;BrMispredicts;BvMP;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueBM", "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredict= ions, tma_mispredicts_resteers", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost,= tma_mispredicts_resteers", "ScaleUnit": "100%" }, { @@ -430,24 +460,24 @@ "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings).", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings)", "MetricExpr": "CPU_CLK_UNHALTED.C01 / tma_info_thread_clks", "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", "MetricName": "tma_c01_wait", - "MetricThreshold": "tma_c01_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_c01_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings).", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings)", "MetricExpr": "CPU_CLK_UNHALTED.C02 / tma_info_thread_clks", "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", "MetricName": "tma_c02_wait", - "MetricThreshold": "tma_c02_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_c02_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -455,8 +485,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources. Sample with: FRONTEND_RETIRE= D.MS_FLOWS", "ScaleUnit": "100%" }, { @@ -464,45 +494,92 @@ "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "MetricExpr": "max(0, tma_icache_misses - tma_code_l2_miss)", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_hit", + "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_COD= E_RD / tma_info_thread_clks", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_miss", + "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_2M_= 4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_4K = / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", - "MetricExpr": "(76.6 * tma_info_system_core_frequency * (MEM_LOAD_= L3_HIT_RETIRED.XSNP_FWD * (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMA= ND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD= ))) + 74.6 * tma_info_system_core_frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_= MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_= info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricExpr": "((81 * tma_info_system_core_frequency - 4.4 * tma_i= nfo_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (OCR.DEMAN= D_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.D= EMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) + (79 * tma_info_system_core_fre= quency - 4.4 * tma_info_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XS= NP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / t= ma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related m= etrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma_remote= _cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false= _sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots wher= e Core non-memory issues were of a bottleneck", - "DefaultMetricgroupName": "TopdownL2", "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)", - "MetricGroup": "Backend;Compute;Default;TmaL2;TopdownL2;tma_L2_gro= up;tma_backend_bound_group", + "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_b= ackend_bound_group", "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", - "MetricExpr": "74.6 * tma_info_system_core_frequency * (MEM_LOAD_L= 3_HIT_RETIRED.XSNP_NO_FWD + MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (1 - OCR.DEM= AND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR= .DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT= / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(79 * tma_info_system_core_frequency - 4.4 * tma_in= fo_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD + MEM_LOAD= _L3_HIT_RETIRED.XSNP_FWD * (1 - OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR= .DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WIT= H_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / t= ma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears, = tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cpu@INS= T_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", + "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%" }, @@ -511,17 +588,17 @@ "MetricExpr": "ARITH.DIV_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates how often the CPU was s= talled on accesses to external memory (DRAM) by loads", - "MetricExpr": "( MEMORY_ACTIVITY.STALLS_L3_MISS / tma_info_thread_= clks )", + "MetricExpr": "MEMORY_ACTIVITY.STALLS_L3_MISS / tma_info_thread_cl= ks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS_PS", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%" }, { @@ -530,7 +607,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", "ScaleUnit": "100%" }, { @@ -538,75 +615,73 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D1= @ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - MEMOR= Y_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D0= x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - MEM= ORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store, tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_me= mory_synchronization", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D1@ = + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", + "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D0x1= @ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, = tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_memory_synchroniz= ation", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", - "MetricExpr": "81 * tma_info_system_core_frequency * OCR.DEMAND_RF= O.L3_HIT.SNOOP_HITM / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricExpr": "(170 * tma_info_system_core_frequency * cpu@OCR.DEM= AND_RFO.L3_MISS\\,offcore_rsp\\=3D0x103b800002@ + 81 * tma_info_system_core= _frequency * OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM) / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing= , tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cac= he", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricExpr": "L1D_PEND_MISS.FB_FULL / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_bottleneck_cache_memory_bandwi= dth, tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store= _latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy, tma_streaming_stores", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend bandwidth issues", - "DefaultMetricgroupName": "TopdownL2", "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)", - "MetricGroup": "Default;FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_gr= oup;tma_frontend_bound_group;tma_issueFB", + "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_= frontend_bound_group;tma_issueFB", "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LA= TENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_sw= itches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tm= a_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend latency issues", - "DefaultMetricgroupName": "TopdownL2", "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.U= OP_DROPPING / tma_info_thread_slots", - "MetricGroup": "Default;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_= frontend_bound_group", + "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend= _bound_group", "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or up to= ([SNB+] four; [ADL+] five) uops", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", "MetricExpr": "max(0, tma_heavy_operations - tma_microcode_sequenc= er)", "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", "MetricName": "tma_few_uops_instructions", "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or up t= o ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number o= f uops in such instructions. Related metrics: tma_decoder0_alone", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions.= Related metrics: tma_decoder0_alone", "ScaleUnit": "100%" }, { @@ -615,7 +690,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", "ScaleUnit": "100%" }, { @@ -624,7 +699,15 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "MetricExpr": "ARITH.FPDIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_fp_divider", + "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -632,8 +715,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED2.SCALAR) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_2= 56b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_2= 56b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -641,8 +724,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.VECTOR + FP_ARITH_INST_RETIR= ED2.VECTOR) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6= , tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -650,8 +733,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.128B_PACKED_HALF= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_= 2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized= _2", "ScaleUnit": "100%" }, { @@ -659,8 +742,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.256B_PACKED_HALF= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_= 2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized= _2", "ScaleUnit": "100%" }, { @@ -668,39 +751,37 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.512B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.512B_PACKED_HALF= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_512b", - "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_= 2", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots wh= ere the processor's Frontend undersupplies its Backend", - "DefaultMetricgroupName": "TopdownL1", "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UO= P_DROPPING / tma_info_thread_slots", - "MetricGroup": "BvFB;BvIO;Default;PGO;TmaL1;TopdownL1;tma_L1_group= ", + "MetricGroup": "BvFB;BvIO;PGO;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", - "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions -- where one uop can represent mu= ltiple contiguous instructions", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", "MetricExpr": "tma_light_operations * INST_RETIRED.MACRO_FUSED / (= tma_retiring * tma_info_thread_slots)", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_fused_instructions", "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions -- where one uop can represent m= ultiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of = legacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Ot= her_Light_Ops in MTL!)}", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", - "DefaultMetricgroupName": "TopdownL2", - "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_in= fo_thread_slots", - "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_re= tiring_group", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .)", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+]). Sample with: UOPS_RET= IRED.HEAVY", "ScaleUnit": "100%" }, { @@ -708,40 +789,40 @@ "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA sl= ots wasted per non-speculative branch misprediction (retired JEClear)", - "MetricExpr": "tma_info_bottleneck_mispredictions * tma_info_threa= d_slots / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 6 / BR_MISP_RETIRED.ALL_BRANCHES / 100", "MetricGroup": "Bad;BrMispredicts;tma_issueBM", "MetricName": "tma_info_bad_spec_branch_misprediction_cost", - "PublicDescription": "Branch Misprediction Cost: Fraction of TMA s= lots wasted per non-speculative branch misprediction (retired JEClear). Rel= ated metrics: tma_branch_mispredicts, tma_info_bottleneck_mispredictions, t= ma_mispredicts_resteers" + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional taken branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_taken < 200" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" }, { - "BriefDescription": "Instructions per retired mispredicts for retu= rn branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_ret", @@ -755,7 +836,7 @@ "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200" }, { - "BriefDescription": "Speculative to Retired ratio of all clears (c= overing mispredicts and nukes)", + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", "MetricGroup": "BrMispredicts", "MetricName": "tma_info_bad_spec_spec_clears_ratio" @@ -769,15 +850,15 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_mite= )))", - "MetricGroup": "DSB;FetchBW;tma_issueFB", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_ms)))", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp" }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_mite))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_ms))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -785,104 +866,10 @@ }, { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", - "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", - "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "tma_info_bottleneck_big_code", - "MetricThreshold": "tma_info_bottleneck_big_code > 20" - }, - { - "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", - "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", - "MetricGroup": "BvBO;Ret", - "MetricName": "tma_info_bottleneck_branching_overhead", - "MetricThreshold": "tma_info_bottleneck_branching_overhead > 5", - "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * ( ( tma_memory_bound * ( tma_dram_bound / ( t= ma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_boun= d ) ) * ( tma_mem_bandwidth / ( tma_mem_bandwidth + tma_mem_latency ) ) ) += ( tma_memory_bound * ( tma_l3_bound / ( tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_dram_bound + tma_store_bound ) ) * ( tma_sq_full / ( tma_con= tested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full ) ) )= + ( tma_memory_bound * ( tma_l1_bound / ( tma_l1_bound + tma_l2_bound + tm= a_l3_bound + tma_dram_bound + tma_store_bound ) ) * ( tma_fb_full / ( tma_d= tlb_load + tma_store_fwd_blk + tma_l1_hit_latency + tma_lock_latency + tma_= split_loads + tma_fb_full ) ) ) )", - "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", - "MetricName": "tma_info_bottleneck_cache_memory_bandwidth", - "MetricThreshold": "tma_info_bottleneck_cache_memory_bandwidth > 2= 0", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * ( ( tma_memory_bound * ( tma_dram_bound / ( t= ma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_boun= d ) ) * ( tma_mem_latency / ( tma_mem_bandwidth + tma_mem_latency ) ) ) + (= tma_memory_bound * ( tma_l3_bound / ( tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_dram_bound + tma_store_bound ) ) * ( tma_l3_hit_latency / ( tm= a_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_full = ) ) ) + ( tma_memory_bound * tma_l2_bound / ( tma_l1_bound + tma_l2_bound += tma_l3_bound + tma_dram_bound + tma_store_bound ) ) + ( tma_memory_bound *= ( tma_store_bound / ( tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dra= m_bound + tma_store_bound ) ) * ( tma_store_latency / ( tma_store_latency += tma_false_sharing + tma_split_stores + tma_streaming_stores + tma_dtlb_sto= re ) ) ) + ( tma_memory_bound * ( tma_l1_bound / ( tma_l1_bound + tma_l2_bo= und + tma_l3_bound + tma_dram_bound + tma_store_bound ) ) * ( tma_l1_hit_la= tency / ( tma_dtlb_load + tma_store_fwd_blk + tma_l1_hit_latency + tma_lock= _latency + tma_split_loads + tma_fb_full ) ) ) )", - "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", - "MetricName": "tma_info_bottleneck_cache_memory_latency", - "MetricThreshold": "tma_info_bottleneck_cache_memory_latency > 20", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" - }, - { - "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_amx_busy= + tma_divider + tma_ports_utilization + tma_serializing_operation) + tma_c= ore_bound * tma_amx_busy / (tma_amx_busy + tma_divider + tma_ports_utilizat= ion + tma_serializing_operation) + tma_core_bound * (tma_ports_utilization = / (tma_amx_busy + tma_divider + tma_ports_utilization + tma_serializing_ope= ration)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_utili= zed_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", - "MetricGroup": "BvCB;Cor;tma_issueComp", - "MetricName": "tma_info_bottleneck_compute_bound_est", - "MetricThreshold": "tma_info_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.MS\\,cmask\\=3D1@) * (tma_fetc= h_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_resteers += tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispredicts)= / (tma_clears_resteers + tma_mispredicts_resteers + tma_unknown_branches))= / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_m= isses + tma_lcp + tma_ms_switches))) - tma_info_bottleneck_big_code", - "MetricGroup": "BvFB;Fed;FetchBW;Frontend", - "MetricName": "tma_info_bottleneck_instruction_fetch_bw", - "MetricThreshold": "tma_info_bottleneck_instruction_fetch_bw > 20" - }, - { - "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu@UOPS_R= ETIRED.MS\\,cmask\\=3D1@) * (tma_fetch_latency * (tma_ms_switches + tma_bra= nch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma_other_= mispredicts / tma_branch_mispredicts) / (tma_clears_resteers + tma_mispredi= cts_resteers + tma_unknown_branches)) / (tma_branch_resteers + tma_dsb_swit= ches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) + = 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_mispredic= ts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / tma_ot= her_nukes + tma_core_bound * (tma_serializing_operation + cpu@RS.EMPTY\\,um= ask\\=3D1@ / tma_info_thread_clks * tma_ports_utilized_0) / (tma_amx_busy += tma_divider + tma_ports_utilization + tma_serializing_operation) + tma_mic= rocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) * = (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", - "MetricName": "tma_info_bottleneck_irregular_overhead", - "MetricThreshold": "tma_info_bottleneck_irregular_overhead > 10", - "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" - }, - { - "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricExpr": "100 * ( tma_memory_bound * ( tma_l1_bound / max( tm= a_memory_bound , ( tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bo= und + tma_store_bound ) ) ) * ( tma_dtlb_load / max( tma_l1_bound , ( tma_d= tlb_load + tma_store_fwd_blk + tma_l1_hit_latency + tma_lock_latency + tma_= split_loads + tma_fb_full ) ) ) + ( tma_memory_bound * ( tma_store_bound / = ( tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_b= ound ) ) * ( tma_dtlb_store / ( tma_store_latency + tma_false_sharing + tma= _split_stores + tma_streaming_stores + tma_dtlb_store ) ) ) )", - "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_data_tlbs", - "MetricThreshold": "tma_info_bottleneck_memory_data_tlbs > 20", - "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store, tma_info_bottleneck_memory_synchronization" - }, - { - "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * ( tma_memory_bound * ( ( tma_dram_bound / ( t= ma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_boun= d ) ) * ( tma_mem_latency / ( tma_mem_bandwidth + tma_mem_latency ) ) * tma= _remote_cache / ( tma_local_mem + tma_remote_mem + tma_remote_cache ) + ( t= ma_l3_bound / ( tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound= + tma_store_bound ) ) * ( tma_contested_accesses + tma_data_sharing ) / ( = tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq_ful= l ) + ( tma_store_bound / ( tma_l1_bound + tma_l2_bound + tma_l3_bound + tm= a_dram_bound + tma_store_bound ) ) * tma_false_sharing / ( ( tma_store_late= ncy + tma_false_sharing + tma_split_stores + tma_streaming_stores + tma_dtl= b_store ) - tma_store_latency ) ) + tma_machine_clears * ( 1 - tma_other_nu= kes / ( tma_other_nukes ) ) )", - "MetricGroup": "BvMS;Mem;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_synchronization", - "MetricThreshold": "tma_info_bottleneck_memory_synchronization > 1= 0", - "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_dtlb_load, tma_dtlb_store, tma_info_bottleneck_me= mory_data_tlbs" - }, - { - "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", - "MetricName": "tma_info_bottleneck_mispredictions", - "MetricThreshold": "tma_info_bottleneck_mispredictions > 20", - "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" - }, - { - "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", - "MetricExpr": "100 - (tma_info_bottleneck_big_code + tma_info_bott= leneck_instruction_fetch_bw + tma_info_bottleneck_mispredictions + tma_info= _bottleneck_cache_memory_bandwidth + tma_info_bottleneck_cache_memory_laten= cy + tma_info_bottleneck_memory_data_tlbs + tma_info_bottleneck_memory_sync= hronization + tma_info_bottleneck_compute_bound_est + tma_info_bottleneck_i= rregular_overhead + tma_info_bottleneck_branching_overhead + tma_info_bottl= eneck_useful_work)", - "MetricGroup": "BvOB;Cor;Offcore", - "MetricName": "tma_info_bottleneck_other_bottlenecks", - "MetricThreshold": "tma_info_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls." - }, - { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", - "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "BvUW;Ret", - "MetricName": "tma_info_bottleneck_useful_work", - "MetricThreshold": "tma_info_bottleneck_useful_work > 20" + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5" }, { "BriefDescription": "Fraction of branches that are CALL or RET", @@ -943,11 +930,11 @@ "MetricExpr": "(FP_ARITH_DISPATCHED.PORT_0 + FP_ARITH_DISPATCHED.P= ORT_1 + FP_ARITH_DISPATCHED.PORT_5) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -960,20 +947,20 @@ "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D1\\,edge@", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost" }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "ICACHE_DATA.STALLS / cpu@ICACHE_DATA.STALLS\\,cmask= \\=3D1\\,edge@", + "MetricExpr": "ICACHE_DATA.STALLS / cpu@ICACHE_DATA.STALLS\\,cmask= \\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency" }, @@ -1002,15 +989,21 @@ "MetricGroup": "IcMiss", "MetricName": "tma_info_frontend_l2mpki_code_all" }, + { + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, { "BriefDescription": "Average number of cycles the front-end was de= layed due to an Unknown Branch detection", - "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu@INT_MISC.UNKNO= WN_BRANCH_CYCLES\\,cmask\\=3D1\\,edge@", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu@INT_MISC.UNKNO= WN_BRANCH_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "Fed", "MetricName": "tma_info_frontend_unknown_branch_cost", - "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node." + "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node" }, { - "BriefDescription": "Branch instructions per taken branch.", + "BriefDescription": "Branch instructions per taken branch", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -1028,7 +1021,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -1036,7 +1029,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -1044,7 +1037,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit in= struction (lower number means higher occurrence rate)", @@ -1052,7 +1045,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx512", "MetricThreshold": "tma_info_inst_mix_iparith_avx512 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -1060,7 +1053,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Half-Pr= ecision instruction (lower number means higher occurrence rate)", @@ -1068,7 +1061,7 @@ "MetricGroup": "Flops;FpScalar;InsType;Server", "MetricName": "tma_info_inst_mix_iparith_scalar_hp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_hp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Half-P= recision instruction (lower number means higher occurrence rate). Values < = 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Half-P= recision instruction (lower number means higher occurrence rate). Values < = 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -1076,7 +1069,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -1121,7 +1114,7 @@ }, { "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", - "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umas= k\\=3D0xF@", + "MetricExpr": "INST_RETIRED.ANY / SW_PREFETCH_ACCESS.ANY", "MetricGroup": "Prefetches", "MetricName": "tma_info_inst_mix_ipswpf", "MetricThreshold": "tma_info_inst_mix_ipswpf < 100" @@ -1131,7 +1124,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 13", + "MetricThreshold": "tma_info_inst_mix_iptb < 6 * 2 + 1", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp" }, { @@ -1178,7 +1171,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw" }, @@ -1196,7 +1189,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw" }, @@ -1238,13 +1231,13 @@ }, { "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", - "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration= _time", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", "MetricGroup": "Mem;MemoryBW;Offcore", "MetricName": "tma_info_memory_l3_cache_access_bw" }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw" }, @@ -1263,12 +1256,12 @@ { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency" }, { "BriefDescription": "Average Parallel L2 cache miss demand Loads", - "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D1@", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", "MetricGroup": "Memory_BW;Offcore", "MetricName": "tma_info_memory_latency_load_l2_mlp" }, @@ -1321,26 +1314,33 @@ "MetricName": "tma_info_memory_mlp", "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)" }, + { + "BriefDescription": "Rate of L2 HW prefetched lines that were not = used by demand accesses", + "MetricExpr": "L2_LINES_OUT.USELESS_HWPF / (L2_LINES_OUT.SILENT + = L2_LINES_OUT.NON_SILENT)", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_memory_prefetches_useless_hwpf", + "MetricThreshold": "tma_info_memory_prefetches_useless_hwpf > 0.15" + }, { "BriefDescription": "Average DRAM BW for Reads-to-Core (R2C) cover= ing for memory attached to local- and remote-socket", - "MetricExpr": "64 * OCR.READS_TO_CORE.DRAM / 1e9 / duration_time", + "MetricExpr": "64 * OCR.READS_TO_CORE.DRAM / 1e9 / tma_info_system= _time", "MetricGroup": "HPC;Mem;MemoryBW;SoC", "MetricName": "tma_info_memory_soc_r2c_dram_bw", - "PublicDescription": "Average DRAM BW for Reads-to-Core (R2C) cove= ring for memory attached to local- and remote-socket. See R2C_Offcore_BW." + "PublicDescription": "Average DRAM BW for Reads-to-Core (R2C) cove= ring for memory attached to local- and remote-socket. See R2C_Offcore_BW" }, { "BriefDescription": "Average L3-cache miss BW for Reads-to-Core (R= 2C)", - "MetricExpr": "64 * OCR.READS_TO_CORE.L3_MISS / 1e9 / duration_tim= e", + "MetricExpr": "64 * OCR.READS_TO_CORE.L3_MISS / 1e9 / tma_info_sys= tem_time", "MetricGroup": "HPC;Mem;MemoryBW;SoC", "MetricName": "tma_info_memory_soc_r2c_l3m_bw", - "PublicDescription": "Average L3-cache miss BW for Reads-to-Core (= R2C). This covering going to DRAM or other memory off-chip memory tears. Se= e R2C_Offcore_BW." + "PublicDescription": "Average L3-cache miss BW for Reads-to-Core (= R2C). This covering going to DRAM or other memory off-chip memory tears. Se= e R2C_Offcore_BW" }, { "BriefDescription": "Average Off-core access BW for Reads-to-Core = (R2C)", - "MetricExpr": "64 * OCR.READS_TO_CORE.ANY_RESPONSE / 1e9 / duratio= n_time", + "MetricExpr": "64 * OCR.READS_TO_CORE.ANY_RESPONSE / 1e9 / tma_inf= o_system_time", "MetricGroup": "HPC;Mem;MemoryBW;SoC", "MetricName": "tma_info_memory_soc_r2c_offcore_bw", - "PublicDescription": "Average Off-core access BW for Reads-to-Core= (R2C). R2C account for demand or prefetch load/RFO/code access that fill d= ata into the Core caches." + "PublicDescription": "Average Off-core access BW for Reads-to-Core= (R2C). R2C account for demand or prefetch load/RFO/code access that fill d= ata into the Core caches" }, { "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", @@ -1369,7 +1369,7 @@ }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D1@)", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, @@ -1390,18 +1390,18 @@ "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, { "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", - "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.SLOTS= \\,cmask\\=3D1@", + "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.SLOTS= \\,cmask\\=3D0x1@", "MetricGroup": "MicroSeq;Pipeline;Ret", "MetricName": "tma_info_pipeline_strings_cycles", "MetricThreshold": "tma_info_pipeline_strings_cycles > 0.1" @@ -1415,7 +1415,7 @@ }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -1433,28 +1433,28 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / duration_time", + "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / tma_info_system_time", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", - "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_bottlenec= k_cache_memory_bandwidth, tma_mem_bandwidth, tma_sq_full" + "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_bottleneck_cache_memory_ban= dwidth, tma_fb_full, tma_mem_bandwidth, tma_sq_full" }, { "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED2.SCALAR_HALF + 2 * (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_= INST_RETIRED2.COMPLEX_SCALAR_HALF) + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 = * (FP_ARITH_INST_RETIRED2.128B_PACKED_HALF + FP_ARITH_INST_RETIRED.8_FLOPS)= + 16 * (FP_ARITH_INST_RETIRED2.256B_PACKED_HALF + FP_ARITH_INST_RETIRED.51= 2B_PACKED_SINGLE) + 32 * FP_ARITH_INST_RETIRED2.512B_PACKED_HALF) / 1e9 / d= uration_time", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED2.SCALAR_HALF + 2 * (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_= INST_RETIRED2.COMPLEX_SCALAR_HALF) + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 = * (FP_ARITH_INST_RETIRED2.128B_PACKED_HALF + FP_ARITH_INST_RETIRED.8_FLOPS)= + 16 * (FP_ARITH_INST_RETIRED2.256B_PACKED_HALF + FP_ARITH_INST_RETIRED.51= 2B_PACKED_SINGLE) + 32 * FP_ARITH_INST_RETIRED2.512B_PACKED_HALF) / 1e9 / t= ma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" }, { "BriefDescription": "Average IO (network or disk) Bandwidth Use fo= r Reads [GB / sec]", - "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e9 / durati= on_time", + "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e9 / tma_in= fo_system_time", "MetricGroup": "IoBW;MemOffcore;Server;SoC", "MetricName": "tma_info_system_io_read_bw", "PublicDescription": "Average IO (network or disk) Bandwidth Use f= or Reads [GB / sec]. Bandwidth of IO reads that are initiated by end device= controllers that are requesting memory from the CPU" }, { "BriefDescription": "Average IO (network or disk) Bandwidth Use fo= r Writes [GB / sec]", - "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM + UNC_CHA_TOR_INSERTS.= IO_ITOMCACHENEAR) * 64 / 1e9 / duration_time", + "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM + UNC_CHA_TOR_INSERTS.= IO_ITOMCACHENEAR) * 64 / 1e9 / tma_info_system_time", "MetricGroup": "IoBW;MemOffcore;Server;SoC", "MetricName": "tma_info_system_io_write_bw", "PublicDescription": "Average IO (network or disk) Bandwidth Use f= or Writes [GB / sec]. Bandwidth of IO writes that are initiated by end devi= ce controllers that are writing memory to the CPU" @@ -1464,13 +1464,14 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi" + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -1481,44 +1482,45 @@ }, { "BriefDescription": "Average latency of data read request to exter= nal DRAM memory [in nanoseconds]", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR / UNC_= CHA_TOR_INSERTS.IA_MISS_DRD_DDR) / uncore_cha_0@event\\=3D0x1@", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR / UNC_= CHA_TOR_INSERTS.IA_MISS_DRD_DDR) / cha_0@event\\=3D0x0@", "MetricGroup": "MemOffcore;MemoryLat;Server;SoC", "MetricName": "tma_info_system_mem_dram_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal DRAM memory [in nanoseconds]. Accounts for demand loads and L1/L2 data= -read prefetches" }, + { + "BriefDescription": "Fraction of Uncore cycles where requests got = rejected due to duplicate address already in IRQ ingress queue in the cache= homing agent", + "MetricExpr": "UNC_CHA_RxC_IRQ1_REJECT.PA_MATCH / UNC_CHA_CLOCKTIC= KS", + "MetricGroup": "LockCont;MemOffcore;Server;SoC", + "MetricName": "tma_info_system_mem_irq_duplicate_address", + "MetricThreshold": "(tma_info_system_mem_irq_duplicate_address > 0= .1)" + }, { "BriefDescription": "Average number of parallel data read requests= to external memory", - "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_OCC= UPANCY.IA_MISS_DRD@thresh\\=3D1@", + "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / cha@UNC_CHA_TOR= _OCCUPANCY.IA_MISS_DRD\\,thresh\\=3D0x1@", "MetricGroup": "Mem;MemoryBW;SoC", "MetricName": "tma_info_system_mem_parallel_reads", "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches" }, - { - "BriefDescription": "Average latency of data read request to exter= nal 3D X-Point memory [in nanoseconds]", - "MetricExpr": "(1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_PMM / UNC= _CHA_TOR_INSERTS.IA_MISS_DRD_PMM) / uncore_cha_0@event\\=3D0x1@ if #has_pme= m > 0 else 0)", - "MetricGroup": "MemOffcore;MemoryLat;Server;SoC", - "MetricName": "tma_info_system_mem_pmm_read_latency", - "PublicDescription": "Average latency of data read request to exte= rnal 3D X-Point memory [in nanoseconds]. Accounts for demand loads and L1/L= 2 data-read prefetches" - }, { "BriefDescription": "Average latency of data read request to exter= nal memory (in nanoseconds)", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_= TOR_INSERTS.IA_MISS_DRD) / (tma_info_system_socket_clks / duration_time)", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_= TOR_INSERTS.IA_MISS_DRD) / (tma_info_system_socket_clks / tma_info_system_t= ime)", "MetricGroup": "Mem;MemoryLat;SoC", "MetricName": "tma_info_system_mem_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetche= s. ([RKL+]memory-controller only)" }, { - "BriefDescription": "Average 3DXP Memory Bandwidth Use for reads [= GB / sec]", - "MetricExpr": "(64 * UNC_M_PMM_RPQ_INSERTS / 1e9 / duration_time i= f #has_pmem > 0 else 0)", - "MetricGroup": "MemOffcore;MemoryBW;Server;SoC", - "MetricName": "tma_info_system_pmm_read_bw" + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" }, { - "BriefDescription": "Average 3DXP Memory Bandwidth Use for Writes = [GB / sec]", - "MetricExpr": "(64 * UNC_M_PMM_WPQ_INSERTS / 1e9 / duration_time i= f #has_pmem > 0 else 0)", - "MetricGroup": "MemOffcore;MemoryBW;Server;SoC", - "MetricName": "tma_info_system_pmm_write_bw" + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "(power@energy\\-pkg@ * 61 + 15.6 * power@energy\\-r= am@) / (duration_time * 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", @@ -1528,10 +1530,17 @@ }, { "BriefDescription": "Socket actual clocks when any core is active = on that socket", - "MetricExpr": "uncore_cha_0@event\\=3D0x1@", + "MetricExpr": "cha_0@event\\=3D0x0@", "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -1540,7 +1549,7 @@ }, { "BriefDescription": "Measured Average Uncore Frequency for the SoC= [GHz]", - "MetricExpr": "tma_info_system_socket_clks / 1e9 / duration_time", + "MetricExpr": "tma_info_system_socket_clks / 1e9 / tma_info_system= _time", "MetricGroup": "SoC", "MetricName": "tma_info_system_uncore_frequency" }, @@ -1551,7 +1560,7 @@ "MetricName": "tma_info_system_upi_data_transmit_bw" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -1560,14 +1569,15 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi" + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -1577,13 +1587,13 @@ }, { "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "TOPDOWN.SLOTS", + "MetricExpr": "slots", "MetricGroup": "TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots" }, { "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", - "MetricExpr": "(tma_info_thread_slots / (TOPDOWN.SLOTS / 2) if #SM= T_on else 1)", + "MetricExpr": "(tma_info_thread_slots / (slots / 2) if #SMT_on els= e 1)", "MetricGroup": "SMT;TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots_utilization" }, @@ -1599,7 +1609,15 @@ "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 9" + "MetricThreshold": "tma_info_thread_uptb < 6 * 1.5" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "MetricExpr": "tma_divider - tma_fp_divider", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_int_divider", + "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents overall Integer (Int) = select operations fraction the CPU has executed (retired)", @@ -1607,7 +1625,7 @@ "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_int_operations", "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operation= s > 0.6", - "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain.", + "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain", "ScaleUnit": "100%" }, { @@ -1615,8 +1633,8 @@ "MetricExpr": "(INT_VEC_RETIRED.ADD_128 + INT_VEC_RETIRED.VNNI_128= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", "MetricName": "tma_int_vector_128b", - "MetricThreshold": "tma_int_vector_128b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", - "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_256b, tma= _port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_int_vector_128b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_256b, tma= _port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1624,8 +1642,8 @@ "MetricExpr": "(INT_VEC_RETIRED.ADD_256 + INT_VEC_RETIRED.MUL_256 = + INT_VEC_RETIRED.VNNI_256) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", "MetricName": "tma_int_vector_256b", - "MetricThreshold": "tma_int_vector_256b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", - "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b,= tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_int_vector_256b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1633,26 +1651,26 @@ "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", "MetricExpr": "max((EXE_ACTIVITY.BOUND_ON_LOADS - MEMORY_ACTIVITY.= STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB= _HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_micr= ocode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates fraction of cyc= les with demand load accesses that hit the L1 cache", + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - MEMORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", - "MetricName": "tma_l1_hit_latency", - "MetricThreshold": "tma_l1_hit_latency > 0.1 & (tma_l1_bound > 0.1= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles with demand load accesses that hit the L1 cache. The short latency of = the L1 data cache may be exposed in pointer-chasing memory access patterns = as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%" }, { @@ -1660,8 +1678,17 @@ "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L1D_MISS - MEMORY_ACTIVITY.= STALLS_L2_MISS) / tma_info_thread_clks", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT_PS", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "4.4 * tma_info_system_core_frequency * MEM_LOAD_RET= IRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) = / tma_info_thread_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, { @@ -1669,17 +1696,17 @@ "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L2_MISS - MEMORY_ACTIVITY.S= TALLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "32.6 * tma_info_system_core_frequency * (MEM_LOAD_R= ETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2= )) / tma_info_thread_clks", + "MetricExpr": "(37 * tma_info_system_core_frequency - 4.4 * tma_in= fo_system_core_frequency) * (MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRE= D.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_info_bottleneck_cache_memory_latency, tma_mem_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", "ScaleUnit": "100%" }, { @@ -1687,19 +1714,18 @@ "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", - "DefaultMetricgroupName": "TopdownL2", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", - "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_re= tiring_group", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", "ScaleUnit": "100%" }, { @@ -1716,7 +1742,7 @@ "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1724,53 +1750,76 @@ "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from local memory", - "MetricExpr": "72 * tma_info_system_core_frequency * MEM_LOAD_L3_M= ISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1= _MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(109 * tma_info_system_core_frequency - 37 * tma_in= fo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_= LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_grou= p", "MetricName": "tma_local_mem", - "MetricThreshold": "tma_local_mem > 0.1 & (tma_mem_latency > 0.1 &= (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)= ))", + "MetricThreshold": "tma_local_mem > 0.1 & tma_mem_latency > 0.1 & = tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_L3_MISS= _RETIRED.LOCAL_DRAM", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", "MetricExpr": "(16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10= * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAN= DING.CYCLES_WITH_DEMAND_RFO))) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Machine Clears", - "DefaultMetricgroupName": "TopdownL2", "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts= )", - "MetricGroup": "BadSpec;BvMS;Default;MachineClears;TmaL2;TopdownL2= ;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", + "MetricGroup": "BadSpec;BvMS;MachineClears;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to memory bandwidth Allocation= feature (RDT's memory bandwidth throttling).", + "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to memory bandwidth Allocation= feature (RDT's memory bandwidth throttling)", "MetricExpr": "INT_MISC.MBA_STALLS / tma_info_thread_clks", "MetricGroup": "MemoryBW;Offcore;Server;TopdownL5;tma_L5_group;tma= _mem_bandwidth_group", "MetricName": "tma_mba_stalls", - "MetricThreshold": "tma_mba_stalls > 0.1 & (tma_mem_bandwidth > 0.= 2 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0= .2)))", + "MetricThreshold": "tma_mba_stalls > 0.1 & tma_mem_bandwidth > 0.2= & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_bottleneck_cache= _memory_bandwidth, tma_info_system_dram_bw_use, tma_sq_full", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%" }, { @@ -1778,32 +1827,31 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_info_bottleneck_cache_memory_latency, tma_l3_hit_la= tency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", - "DefaultMetricgroupName": "TopdownL2", - "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_in= fo_thread_slots", - "MetricGroup": "Backend;Default;TmaL2;TopdownL2;tma_L2_group;tma_b= ackend_bound_group", + "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_b= ound_group", "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions.", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "13 * MISC2_RETIRED.LFENCE / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_memory_fence", - "MetricThreshold": "tma_memory_fence > 0.05 & (tma_serializing_ope= ration > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_memory_fence > 0.05 & tma_serializing_oper= ation > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", "MetricExpr": "tma_light_operations * MEM_UOP_RETIRED.ANY / (tma_r= etiring * tma_info_thread_slots)", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_memory_operations", @@ -1816,7 +1864,7 @@ "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", "MetricName": "tma_microcode_sequencer", "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", - "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_clears_reste= ers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_clea= rs, tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_bottleneck_i= rregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, t= ma_ms_switches", "ScaleUnit": "100%" }, { @@ -1824,8 +1872,8 @@ "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * INT_= MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_branch_mispredicts, tma_info_bad_spec_branch_misprediction_cost= , tma_info_bottleneck_mispredictions", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%" }, { @@ -1838,21 +1886,29 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", "MetricExpr": "160 * ASSISTS.SSE_AVX_MIX / tma_info_thread_clks", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "MetricExpr": "max(IDQ.MS_CYCLES_ANY, cpu@UOPS_RETIRED.MS\\,cmask\= \=3D0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY)) / tma_info_core_core_clk= s / 2", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", + "MetricName": "tma_ms", + "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", - "MetricExpr": "3 * cpu@UOPS_RETIRED.MS\\,cmask\\=3D1\\,edge@ / (UO= PS_RETIRED.SLOTS / UOPS_ISSUED.ANY) / tma_info_thread_clks", + "MetricExpr": "3 * cpu@UOPS_RETIRED.MS\\,cmask\\=3D0x1\\,edge\\=3D= 0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY) / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_= clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operat= ion", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: FRONTEND_RETIRED.MS_FLOWS. Related metrics: tm= a_bottleneck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_mac= hine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_o= peration", "ScaleUnit": "100%" }, { @@ -1861,7 +1917,7 @@ "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_non_fused_branches", "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused.", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", "ScaleUnit": "100%" }, { @@ -1869,7 +1925,7 @@ "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, @@ -1883,19 +1939,19 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { @@ -1904,7 +1960,7 @@ "MetricGroup": "TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_page_faults", "MetricThreshold": "tma_page_faults > 0.05", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost.", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost", "ScaleUnit": "100%" }, { @@ -1913,7 +1969,7 @@ "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_int_vector_128b, tma_int_vector_256b, tma_port_1, tma_port_5, tma_po= rt_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_int_vector_128b, tma_int_vector_256b, tma_port_1, tma_port_6, tma_po= rts_utilized_2", "ScaleUnit": "100%" }, { @@ -1922,7 +1978,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_ve= ctor_256b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_ve= ctor_256b, tma_port_0, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1931,25 +1987,25 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_6. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_por= t_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_por= ts_utilized_2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", - "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * cpu@EXE_ACTIVITY.2_PORTS_UTIL\\,um= ask\\=3D0xc@)) / tma_info_thread_clks if ARITH.DIV_ACTIVE < CYCLE_ACTIVITY.= STALLS_TOTAL - EXE_ACTIVITY.BOUND_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL = + tma_retiring * cpu@EXE_ACTIVITY.2_PORTS_UTIL\\,umask\\=3D0xc@) / tma_info= _thread_clks)", + "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_3_PORTS_UTIL)) / tm= a_info_thread_clks if ARITH.DIV_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - EXE_= ACTIVITY.BOUND_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EX= E_ACTIVITY.2_3_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", - "MetricExpr": "(EXE_ACTIVITY.EXE_BOUND_0_PORTS + max(cpu@RS.EMPTY\= \,umask\\=3D1@ - RESOURCE_STALLS.SCOREBOARD, 0)) / tma_info_thread_clks * (= CYCLE_ACTIVITY.STALLS_TOTAL - EXE_ACTIVITY.BOUND_ON_LOADS) / tma_info_threa= d_clks", + "MetricExpr": "(EXE_ACTIVITY.EXE_BOUND_0_PORTS + max(RS.EMPTY_RESO= URCE - RESOURCE_STALLS.SCOREBOARD, 0)) / tma_info_thread_clks * (CYCLE_ACTI= VITY.STALLS_TOTAL - EXE_ACTIVITY.BOUND_ON_LOADS) / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", "ScaleUnit": "100%" }, { @@ -1957,7 +2013,7 @@ "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", "ScaleUnit": "100%" }, @@ -1967,8 +2023,8 @@ "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6", "ScaleUnit": "100%" }, { @@ -1977,36 +2033,35 @@ "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote cache in other socket= s including synchronizations issues", - "MetricExpr": "(133 * tma_info_system_core_frequency * MEM_LOAD_L3= _MISS_RETIRED.REMOTE_HITM + 133 * tma_info_system_core_frequency * MEM_LOAD= _L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETI= RED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "((170 * tma_info_system_core_frequency - 37 * tma_i= nfo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM + (170 * = tma_info_system_core_frequency - 37 * tma_info_system_core_frequency) * MEM= _LOAD_L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD= _RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_is= sueSyncxn;tma_mem_latency_group", "MetricName": "tma_remote_cache", - "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0= .1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > = 0.2)))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_L3_MISS_R= ETIRED.REMOTE_HITM_PS;MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD_PS. Related metri= cs: tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_machin= e_clears", + "MetricThreshold": "tma_remote_cache > 0.05 & tma_mem_latency > 0.= 1 & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2= ", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. Sample with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM, MEM= _LOAD_L3_MISS_RETIRED.REMOTE_FWD. Related metrics: tma_bottleneck_memory_sy= nchronization, tma_contested_accesses, tma_data_sharing, tma_false_sharing,= tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote memory", - "MetricExpr": "153 * tma_info_system_core_frequency * MEM_LOAD_L3_= MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.= L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(190 * tma_info_system_core_frequency - 37 * tma_in= fo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM= _LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks= ", "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latenc= y_group", "MetricName": "tma_remote_mem", - "MetricThreshold": "tma_remote_mem > 0.1 & (tma_mem_latency > 0.1 = & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. #link to NUMA article. Sample= with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM_PS", + "MetricThreshold": "tma_remote_mem > 0.1 & tma_mem_latency > 0.1 &= tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. Sample with: MEM_LOAD_L3_MISS= _RETIRED.REMOTE_DRAM", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", - "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_= thread_slots", - "MetricGroup": "BvUW;Default;TmaL1;TopdownL1;tma_L1_group", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "BvUW;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", - "MetricgroupNoGroup": "TopdownL1;Default", + "MetricgroupNoGroup": "TopdownL1", "PublicDescription": "This category represents fraction of slots u= tilized by useful work i.e. issued uops that eventually get retired. Ideall= y; all pipeline slots would be attributed to the Retiring category. Retiri= ng of 100% would indicate the maximum Pipeline_Width throughput was achieve= d. Maximizing Retiring typically increases the Instructions-per-cycle (see= IPC metric). Note that a high Retiring value does not necessary mean there= is no room for more performance. For example; Heavy-operations or Microco= de Assists are categorized under Retiring. They often indicate suboptimal p= erformance and can often be optimized or avoided. Sample with: UOPS_RETIRED= .SLOTS", "ScaleUnit": "100%" }, @@ -2015,7 +2070,7 @@ "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_thread_clks += tma_c02_wait", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", "ScaleUnit": "100%" }, @@ -2024,8 +2079,8 @@ "MetricExpr": "tma_light_operations * INT_VEC_RETIRED.SHUFFLES / (= tma_retiring * tma_info_thread_slots)", "MetricGroup": "HPC;Pipeline;TopdownL4;tma_L4_group;tma_other_ligh= t_ops_group", "MetricName": "tma_shuffles_256b", - "MetricThreshold": "tma_shuffles_256b > 0.1 & (tma_other_light_ops= > 0.3 & tma_light_operations > 0.6)", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers.", + "MetricThreshold": "tma_shuffles_256b > 0.1 & tma_other_light_ops = > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers", "ScaleUnit": "100%" }, { @@ -2034,7 +2089,7 @@ "MetricExpr": "CPU_CLK_UNHALTED.PAUSE / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_slow_pause", - "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.= PAUSE_INST", "ScaleUnit": "100%" }, @@ -2043,8 +2098,8 @@ "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", "ScaleUnit": "100%" }, { @@ -2052,17 +2107,17 @@ "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", "ScaleUnit": "100%" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", "MetricExpr": "(XQ.FULL_CYCLES + L1D_PEND_MISS.L2_STALLS) / tma_in= fo_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_bottleneck_cache_memory_bandwidth, tma_info_system_dram_bw_use, = tma_mem_bandwidth", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%" }, { @@ -2070,8 +2125,8 @@ "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", "ScaleUnit": "100%" }, { @@ -2079,17 +2134,17 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", "MetricExpr": "(MEM_STORE_RETIRED.L2_HIT * 10 * (1 - MEM_INST_RETI= RED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_= LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE= _REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -2106,7 +2161,7 @@ "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", "ScaleUnit": "100%" }, { @@ -2114,7 +2169,31 @@ "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -2122,7 +2201,7 @@ "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", "MetricName": "tma_streaming_stores", - "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound = > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", "ScaleUnit": "100%" }, @@ -2131,7 +2210,7 @@ "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / tma_info_thread_cl= ks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", "ScaleUnit": "100%" }, @@ -2140,29 +2219,8 @@ "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Percentage of cycles in aborted transactions.= ", - "MetricExpr": "(max(cycles\\-t - cycles\\-ct, 0) / cycles if has_e= vent(cycles\\-t) else 0)", - "MetricGroup": "transaction", - "MetricName": "tsx_aborted_cycles", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "Number of cycles within a transaction divided= by the number of transactions.", - "MetricExpr": "(cycles\\-t / tx\\-start if has_event(cycles\\-t) e= lse 0)", - "MetricGroup": "transaction", - "MetricName": "tsx_cycles_per_transaction", - "ScaleUnit": "1cycles / transaction" - }, - { - "BriefDescription": "Percentage of cycles within a transaction reg= ion.", - "MetricExpr": "(cycles\\-t / cycles if has_event(cycles\\-t) else = 0)", - "MetricGroup": "transaction", - "MetricName": "tsx_transactional_cycles", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", "ScaleUnit": "100%" }, { @@ -2171,12 +2229,6 @@ "MetricName": "uncore_frequency", "ScaleUnit": "1GHz" }, - { - "BriefDescription": "Intel(R) Ultra Path Interconnect (UPI) data r= eceive bandwidth (MB/sec)", - "MetricExpr": "UNC_UPI_RxL_FLITS.ALL_DATA * 7.111111111111111 / 1e= 6 / duration_time", - "MetricName": "upi_data_receive_bw", - "ScaleUnit": "1MB/s" - }, { "BriefDescription": "Intel(R) Ultra Path Interconnect (UPI) data t= ransmit bandwidth (MB/sec)", "MetricExpr": "UNC_UPI_TxL_FLITS.ALL_DATA * 7.111111111111111 / 1e= 6 / duration_time", diff --git a/tools/perf/pmu-events/arch/x86/emeraldrapids/frontend.json b/t= ools/perf/pmu-events/arch/x86/emeraldrapids/frontend.json index f6e3e40a3b20..bf68493d4509 100644 --- a/tools/perf/pmu-events/arch/x86/emeraldrapids/frontend.json +++ b/tools/perf/pmu-events/arch/x86/emeraldrapids/frontend.json @@ -41,7 +41,6 @@ "EventName": "FRONTEND_RETIRED.ANY_DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x1", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= DSB (Decode stream buffer i.e. the decoded instruction-cache) miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -53,7 +52,6 @@ "EventName": "FRONTEND_RETIRED.DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x11", - "PEBS": "1", "PublicDescription": "Number of retired Instructions that experien= ced a critical DSB (Decode stream buffer i.e. the decoded instruction-cache= ) miss. Critical means stalls were exposed to the back-end as a result of t= he DSB miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -65,7 +63,6 @@ "EventName": "FRONTEND_RETIRED.ITLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x14", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= iTLB (Instruction TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -77,7 +74,6 @@ "EventName": "FRONTEND_RETIRED.L1I_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x12", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L1 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -89,7 +85,6 @@ "EventName": "FRONTEND_RETIRED.L2_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x13", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L2 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -101,7 +96,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x600106", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 1 cycle which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -113,7 +107,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_128", "MSRIndex": "0x3F7", "MSRValue": "0x608006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 12= 8 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -125,7 +118,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_16", "MSRIndex": "0x3F7", "MSRValue": "0x601006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 16 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -137,7 +129,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2", "MSRIndex": "0x3F7", "MSRValue": "0x600206", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -149,7 +140,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_256", "MSRIndex": "0x3F7", "MSRValue": "0x610006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 25= 6 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -161,7 +151,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x100206", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after the front-end had at least 1 bubble-slot for a per= iod of 2 cycles. A bubble-slot is an empty issue-pipeline slot while there = was no RAT stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -173,7 +162,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_32", "MSRIndex": "0x3F7", "MSRValue": "0x602006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 32 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -185,7 +173,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_4", "MSRIndex": "0x3F7", "MSRValue": "0x600406", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 4 = cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -197,7 +184,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_512", "MSRIndex": "0x3F7", "MSRValue": "0x620006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 51= 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -209,7 +195,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_64", "MSRIndex": "0x3F7", "MSRValue": "0x604006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 64= cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -221,7 +206,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_8", "MSRIndex": "0x3F7", "MSRValue": "0x600806", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 8 cycles. During thi= s period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -233,7 +217,6 @@ "EventName": "FRONTEND_RETIRED.MS_FLOWS", "MSRIndex": "0x3F7", "MSRValue": "0x8", - "PEBS": "1", "SampleAfterValue": "100007", "UMask": "0x1" }, @@ -244,7 +227,6 @@ "EventName": "FRONTEND_RETIRED.STLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x15", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= STLB (2nd level TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -256,7 +238,6 @@ "EventName": "FRONTEND_RETIRED.UNKNOWN_BRANCH", "MSRIndex": "0x3F7", "MSRValue": "0x17", - "PEBS": "1", "SampleAfterValue": "100007", "UMask": "0x1" }, diff --git a/tools/perf/pmu-events/arch/x86/emeraldrapids/memory.json b/too= ls/perf/pmu-events/arch/x86/emeraldrapids/memory.json index 2ea19539291b..41d4120d4dae 100644 --- a/tools/perf/pmu-events/arch/x86/emeraldrapids/memory.json +++ b/tools/perf/pmu-events/arch/x86/emeraldrapids/memory.json @@ -63,7 +63,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_1024", "MSRIndex": "0x3F6", "MSRValue": "0x400", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 1024 cycles. Reporte= d latency may be longer than just the memory latency.", "SampleAfterValue": "53", "UMask": "0x1" @@ -76,7 +75,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_128", "MSRIndex": "0x3F6", "MSRValue": "0x80", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 128 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "1009", "UMask": "0x1" @@ -89,7 +87,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_16", "MSRIndex": "0x3F6", "MSRValue": "0x10", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 16 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "20011", "UMask": "0x1" @@ -102,7 +99,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_256", "MSRIndex": "0x3F6", "MSRValue": "0x100", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 256 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "503", "UMask": "0x1" @@ -115,7 +111,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32", "MSRIndex": "0x3F6", "MSRValue": "0x20", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 32 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -128,7 +123,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4", "MSRIndex": "0x3F6", "MSRValue": "0x4", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 4 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "100003", "UMask": "0x1" @@ -141,7 +135,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_512", "MSRIndex": "0x3F6", "MSRValue": "0x200", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 512 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "101", "UMask": "0x1" @@ -154,7 +147,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64", "MSRIndex": "0x3F6", "MSRValue": "0x40", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 64 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "2003", "UMask": "0x1" @@ -167,7 +159,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_8", "MSRIndex": "0x3F6", "MSRValue": "0x8", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 8 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "50021", "UMask": "0x1" @@ -178,7 +169,6 @@ "Data_LA": "1", "EventCode": "0xcd", "EventName": "MEM_TRANS_RETIRED.STORE_SAMPLE", - "PEBS": "2", "PublicDescription": "Counts Retired memory accesses with at least= 1 store operation. This PEBS event is the precisely-distributed (PDist) tr= igger covering all stores uops for sampling by the PEBS Store Latency Facil= ity. The facility is described in Intel SDM Volume 3 section 19.9.8", "SampleAfterValue": "1000003", "UMask": "0x2" @@ -305,17 +295,16 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc9", "EventName": "RTM_RETIRED.ABORTED", - "PEBS": "1", "PublicDescription": "Counts the number of times RTM abort was tri= ggered.", "SampleAfterValue": "100003", "UMask": "0x4" }, { - "BriefDescription": "Number of times an RTM execution aborted due = to none of the previous 4 categories (e.g. interrupt)", + "BriefDescription": "Number of times an RTM execution aborted due = to none of the previous 3 categories (e.g. interrupt)", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc9", "EventName": "RTM_RETIRED.ABORTED_EVENTS", - "PublicDescription": "Counts the number of times an RTM execution = aborted due to none of the previous 4 categories (e.g. interrupt).", + "PublicDescription": "Counts the number of times an RTM execution = aborted due to none of the previous 3 categories (e.g. interrupt).", "SampleAfterValue": "100003", "UMask": "0x80" }, diff --git a/tools/perf/pmu-events/arch/x86/emeraldrapids/metricgroups.json= b/tools/perf/pmu-events/arch/x86/emeraldrapids/metricgroups.json index e1de6c2675c4..9129fb7b7ce4 100644 --- a/tools/perf/pmu-events/arch/x86/emeraldrapids/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/emeraldrapids/metricgroups.json @@ -40,6 +40,7 @@ "IoBW": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -86,7 +87,9 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_mispredicts_group": "Metrics contributing to tma_branch_mi= spredicts category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", + "tma_code_stlb_miss_group": "Metrics contributing to tma_code_stlb_mis= s category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -96,6 +99,7 @@ "tma_fp_vector_group": "Metrics contributing to tma_fp_vector category= ", "tma_frontend_bound_group": "Metrics contributing to tma_frontend_boun= d category", "tma_heavy_operations_group": "Metrics contributing to tma_heavy_opera= tions category", + "tma_icache_misses_group": "Metrics contributing to tma_icache_misses = category", "tma_int_operations_group": "Metrics contributing to tma_int_operation= s category", "tma_issue2P": "Metrics related by the issue $issue2P", "tma_issueBM": "Metrics related by the issue $issueBM", @@ -116,10 +120,13 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", + "tma_l2_bound_group": "Metrics contributing to tma_l2_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", "tma_load_op_utilization_group": "Metrics contributing to tma_load_op_= utilization category", + "tma_load_stlb_miss_group": "Metrics contributing to tma_load_stlb_mis= s category", "tma_machine_clears_group": "Metrics contributing to tma_machine_clear= s category", "tma_mem_bandwidth_group": "Metrics contributing to tma_mem_bandwidth = category", "tma_mem_latency_group": "Metrics contributing to tma_mem_latency cate= gory", @@ -133,5 +140,6 @@ "tma_retiring_group": "Metrics contributing to tma_retiring category", "tma_serializing_operation_group": "Metrics contributing to tma_serial= izing_operation category", "tma_store_bound_group": "Metrics contributing to tma_store_bound cate= gory", - "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category" + "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category", + "tma_store_stlb_miss_group": "Metrics contributing to tma_store_stlb_m= iss category" } diff --git a/tools/perf/pmu-events/arch/x86/emeraldrapids/pipeline.json b/t= ools/perf/pmu-events/arch/x86/emeraldrapids/pipeline.json index 5d5811f26151..50cacfbbc7cf 100644 --- a/tools/perf/pmu-events/arch/x86/emeraldrapids/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/emeraldrapids/pipeline.json @@ -62,7 +62,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all branch instructions retired.", "SampleAfterValue": "400009" }, @@ -71,7 +70,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts conditional branch instructions retir= ed.", "SampleAfterValue": "400009", "UMask": "0x11" @@ -81,7 +79,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts not taken branch instructions retired= .", "SampleAfterValue": "400009", "UMask": "0x10" @@ -91,7 +88,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional branch instructions= retired.", "SampleAfterValue": "400009", "UMask": "0x1" @@ -101,7 +97,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.FAR_BRANCH", - "PEBS": "1", "PublicDescription": "Counts far branch instructions retired.", "SampleAfterValue": "100007", "UMask": "0x40" @@ -111,7 +106,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts near indirect branch instructions ret= ired excluding returns. TSX abort is an indirect branch.", "SampleAfterValue": "100003", "UMask": "0x80" @@ -121,7 +115,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_CALL", - "PEBS": "1", "PublicDescription": "Counts both direct and indirect near call in= structions retired.", "SampleAfterValue": "100007", "UMask": "0x2" @@ -131,7 +124,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_RETURN", - "PEBS": "1", "PublicDescription": "Counts return instructions retired.", "SampleAfterValue": "100007", "UMask": "0x8" @@ -141,7 +133,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken branch instructions retired.", "SampleAfterValue": "400009", "UMask": "0x20" @@ -151,7 +142,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all the retired branch instructions t= hat were mispredicted by the processor. A branch misprediction occurs when = the processor incorrectly predicts the destination of the branch. When the= misprediction is discovered at execution, all the instructions executed in= the wrong (speculative) path must be discarded, and the processor must sta= rt fetching from the correct path.", "SampleAfterValue": "400009" }, @@ -160,7 +150,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts mispredicted conditional branch instr= uctions retired.", "SampleAfterValue": "400009", "UMask": "0x11" @@ -170,7 +159,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts the number of conditional branch inst= ructions retired that were mispredicted and the branch direction was not ta= ken.", "SampleAfterValue": "400009", "UMask": "0x10" @@ -180,7 +168,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional mispredicted branch= instructions retired.", "SampleAfterValue": "400009", "UMask": "0x1" @@ -190,7 +177,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts miss-predicted near indirect branch i= nstructions retired excluding returns. TSX abort is an indirect branch.", "SampleAfterValue": "100003", "UMask": "0x80" @@ -200,7 +186,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT_CALL", - "PEBS": "1", "PublicDescription": "Counts retired mispredicted indirect (near t= aken) CALL instructions, including both register and memory indirect.", "SampleAfterValue": "400009", "UMask": "0x2" @@ -210,7 +195,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts number of near branch instructions re= tired that were mispredicted and taken.", "SampleAfterValue": "400009", "UMask": "0x20" @@ -220,7 +204,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.RET", - "PEBS": "1", "PublicDescription": "This is a non-precise version (that is, does= not use PEBS) of the event that counts mispredicted return instructions re= tired.", "SampleAfterValue": "100007", "UMask": "0x8" @@ -469,7 +452,6 @@ "BriefDescription": "Number of instructions retired. Fixed Counter= - architectural event", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003", "UMask": "0x1" @@ -479,7 +461,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.ANY_P", - "PEBS": "1", "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003" }, @@ -488,7 +469,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.MACRO_FUSED", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x10" }, @@ -497,7 +477,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.NOP", - "PEBS": "1", "PublicDescription": "Counts all retired NOP or ENDBR32/64 instruc= tions", "SampleAfterValue": "2000003", "UMask": "0x2" @@ -506,7 +485,6 @@ "BriefDescription": "Precise instruction retired with PEBS precise= -distribution", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.PREC_DIST", - "PEBS": "1", "PublicDescription": "A version of INST_RETIRED that allows for a = precise distribution of samples across instructions retired. It utilizes th= e Precise Distribution of Instructions Retired (PDIR++) feature to fix bias= in how retired instructions get sampled. Use on Fixed Counter 0.", "SampleAfterValue": "2000003", "UMask": "0x1" @@ -516,7 +494,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.REP_ITERATION", - "PEBS": "1", "PublicDescription": "Number of iterations of Repeat (REP) string = retired instructions such as MOVS, CMPS, and SCAS. Each has a byte, word, a= nd doubleword version and string instructions can be repeated using a repet= ition prefix, REP, that allows their architectural execution to be repeated= a number of times as specified by the RCX register. Note the number of ite= rations is implementation-dependent.", "SampleAfterValue": "2000003", "UMask": "0x8" diff --git a/tools/perf/pmu-events/arch/x86/emeraldrapids/uncore-io.json b/= tools/perf/pmu-events/arch/x86/emeraldrapids/uncore-io.json index 91013ced74aa..94340dee1c9c 100644 --- a/tools/perf/pmu-events/arch/x86/emeraldrapids/uncore-io.json +++ b/tools/perf/pmu-events/arch/x86/emeraldrapids/uncore-io.json @@ -79,86 +79,6 @@ "UMask": "0x27", "Unit": "iio_free_running" }, - { - "BriefDescription": "Free running counter that increments for ever= y 32 bytes of data sent from the IO agent to the SOC", - "Counter": "9", - "EventCode": "0xff", - "EventName": "UNC_IIO_BANDWIDTH_OUT.PART0_FREERUN", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x30", - "Unit": "iio_free_running" - }, - { - "BriefDescription": "Free running counter that increments for ever= y 32 bytes of data sent from the IO agent to the SOC", - "Counter": "10", - "EventCode": "0xff", - "EventName": "UNC_IIO_BANDWIDTH_OUT.PART1_FREERUN", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x31", - "Unit": "iio_free_running" - }, - { - "BriefDescription": "Free running counter that increments for ever= y 32 bytes of data sent from the IO agent to the SOC", - "Counter": "11", - "EventCode": "0xff", - "EventName": "UNC_IIO_BANDWIDTH_OUT.PART2_FREERUN", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x32", - "Unit": "iio_free_running" - }, - { - "BriefDescription": "Free running counter that increments for ever= y 32 bytes of data sent from the IO agent to the SOC", - "Counter": "12", - "EventCode": "0xff", - "EventName": "UNC_IIO_BANDWIDTH_OUT.PART3_FREERUN", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x33", - "Unit": "iio_free_running" - }, - { - "BriefDescription": "Free running counter that increments for ever= y 32 bytes of data sent from the IO agent to the SOC", - "Counter": "13", - "EventCode": "0xff", - "EventName": "UNC_IIO_BANDWIDTH_OUT.PART4_FREERUN", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x34", - "Unit": "iio_free_running" - }, - { - "BriefDescription": "Free running counter that increments for ever= y 32 bytes of data sent from the IO agent to the SOC", - "Counter": "14", - "EventCode": "0xff", - "EventName": "UNC_IIO_BANDWIDTH_OUT.PART5_FREERUN", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x35", - "Unit": "iio_free_running" - }, - { - "BriefDescription": "Free running counter that increments for ever= y 32 bytes of data sent from the IO agent to the SOC", - "Counter": "15", - "EventCode": "0xff", - "EventName": "UNC_IIO_BANDWIDTH_OUT.PART6_FREERUN", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x36", - "Unit": "iio_free_running" - }, - { - "BriefDescription": "Free running counter that increments for ever= y 32 bytes of data sent from the IO agent to the SOC", - "Counter": "16", - "EventCode": "0xff", - "EventName": "UNC_IIO_BANDWIDTH_OUT.PART7_FREERUN", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x37", - "Unit": "iio_free_running" - }, { "BriefDescription": "IIO Clockticks", "Counter": "0,1,2,3", @@ -201,7 +121,7 @@ "PerPkg": "1", "PortMask": "0x0001", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 0 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -214,7 +134,7 @@ "PerPkg": "1", "PortMask": "0x0002", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 1 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 1", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -227,7 +147,7 @@ "PerPkg": "1", "PortMask": "0x0004", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 2 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 2", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -240,7 +160,7 @@ "PerPkg": "1", "PortMask": "0x0008", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 2 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 3", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -253,7 +173,7 @@ "PerPkg": "1", "PortMask": "0x0010", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 0 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 4", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -266,7 +186,7 @@ "PerPkg": "1", "PortMask": "0x0020", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 1 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 5", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -279,7 +199,7 @@ "PerPkg": "1", "PortMask": "0x0040", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 2 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 6", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -292,7 +212,7 @@ "PerPkg": "1", "PortMask": "0x0080", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 2 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 7", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -316,7 +236,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x16 card plugged in to stack, Or x8 card plu= gged in to Lane 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7000001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -329,7 +249,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x4 card is plugged in to slot 1", - "UMask": "0x7000002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -342,7 +262,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x8 card plugged in to Lane 2/3, Or x4 card i= s plugged in to slot 1", - "UMask": "0x7000004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -355,7 +275,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x4 card is plugged in to slot 3", - "UMask": "0x7000008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -368,7 +288,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x16 card plugged in to stack, Or x8 card plu= gged in to Lane 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7000010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -381,7 +301,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x4 card is plugged in to slot 1", - "UMask": "0x7000020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -394,7 +314,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x8 card plugged in to Lane 2/3, Or x4 card i= s plugged in to slot 1", - "UMask": "0x7000040", + "UMask": "0x40", "Unit": "IIO" }, { @@ -407,7 +327,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x4 card is plugged in to slot 3", - "UMask": "0x7000080", + "UMask": "0x80", "Unit": "IIO" }, { @@ -431,7 +351,7 @@ "PerPkg": "1", "PortMask": "0x0001", "PublicDescription": "Data requested by the CPU : Core reading fro= m Card's MMIO space : Number of DWs (4 bytes) requested by the main die. I= ncludes all requests initiated by the main die, including reads and writes.= : x16 card plugged in to stack, Or x8 card plugged in to Lane 0/1, Or x4 c= ard is plugged in to slot 0", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -443,7 +363,7 @@ "PerPkg": "1", "PortMask": "0x0002", "PublicDescription": "Data requested by the CPU : Core reading fro= m Card's MMIO space : Number of DWs (4 bytes) requested by the main die. I= ncludes all requests initiated by the main die, including reads and writes.= : x4 card is plugged in to slot 1", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -455,7 +375,7 @@ "PerPkg": "1", "PortMask": "0x0004", "PublicDescription": "Data requested by the CPU : Core reading fro= m Card's MMIO space : Number of DWs (4 bytes) requested by the main die. I= ncludes all requests initiated by the main die, including reads and writes.= : x8 card plugged in to Lane 2/3, Or x4 card is plugged in to slot 1", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -467,7 +387,7 @@ "PerPkg": "1", "PortMask": "0x0008", "PublicDescription": "Data requested by the CPU : Core reading fro= m Card's MMIO space : Number of DWs (4 bytes) requested by the main die. I= ncludes all requests initiated by the main die, including reads and writes.= : x4 card is plugged in to slot 3", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -479,7 +399,7 @@ "PerPkg": "1", "PortMask": "0x0010", "PublicDescription": "Data requested by the CPU : Core reading fro= m Cards MMIO space : Number of DWs (4 bytes) requested by the main die. In= cludes all requests initiated by the main die, including reads and writes. = : x16 card plugged in to stack, Or x8 card plugged in to Lane 0/1, Or x4 ca= rd is plugged in to slot 0", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -491,7 +411,7 @@ "PerPkg": "1", "PortMask": "0x0020", "PublicDescription": "Data requested by the CPU : Core reading fro= m Cards MMIO space : Number of DWs (4 bytes) requested by the main die. In= cludes all requests initiated by the main die, including reads and writes. = : x4 card is plugged in to slot 1", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -503,7 +423,7 @@ "PerPkg": "1", "PortMask": "0x0040", "PublicDescription": "Data requested by the CPU : Core reading fro= m Cards MMIO space : Number of DWs (4 bytes) requested by the main die. In= cludes all requests initiated by the main die, including reads and writes. = : x8 card plugged in to Lane 2/3, Or x4 card is plugged in to slot 1", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -515,7 +435,7 @@ "PerPkg": "1", "PortMask": "0x0080", "PublicDescription": "Data requested by the CPU : Core reading fro= m Cards MMIO space : Number of DWs (4 bytes) requested by the main die. In= cludes all requests initiated by the main die, including reads and writes. = : x4 card is plugged in to slot 3", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -662,7 +582,7 @@ "PerPkg": "1", "PortMask": "0x0001", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x16 card plugged in to stack, Or x8 card plugged = in to Lane 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7001008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -675,7 +595,7 @@ "PerPkg": "1", "PortMask": "0x0002", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x4 card is plugged in to slot 1", - "UMask": "0x7002008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -688,7 +608,7 @@ "PerPkg": "1", "PortMask": "0x0004", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x8 card plugged in to Lane 2/3, Or x4 card is plu= gged in to slot 1", - "UMask": "0x7004008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -701,7 +621,7 @@ "PerPkg": "1", "PortMask": "0x0008", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x4 card is plugged in to slot 3", - "UMask": "0x7008008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -714,7 +634,7 @@ "PerPkg": "1", "PortMask": "0x0010", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x16 card plugged in to stack, Or x8 card plugged = in to Lane 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7010008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -727,7 +647,7 @@ "PerPkg": "1", "PortMask": "0x0020", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x4 card is plugged in to slot 1", - "UMask": "0x7020008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -740,7 +660,7 @@ "PerPkg": "1", "PortMask": "0x0040", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x8 card plugged in to Lane 2/3, Or x4 card is plu= gged in to slot 1", - "UMask": "0x7040008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -753,7 +673,7 @@ "PerPkg": "1", "PortMask": "0x0080", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x4 card is plugged in to slot 3", - "UMask": "0x7080008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -766,7 +686,7 @@ "PerPkg": "1", "PortMask": "0x0001", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x16 card plugged in to stack, Or x8 card plugged in= to Lane 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7001002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -779,7 +699,7 @@ "PerPkg": "1", "PortMask": "0x0002", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x4 card is plugged in to slot 1", - "UMask": "0x7002002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -792,7 +712,7 @@ "PerPkg": "1", "PortMask": "0x0004", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x8 card plugged in to Lane 2/3, Or x4 card is plugg= ed in to slot 1", - "UMask": "0x7004002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -805,7 +725,7 @@ "PerPkg": "1", "PortMask": "0x0008", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x4 card is plugged in to slot 3", - "UMask": "0x7008002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -818,7 +738,7 @@ "PerPkg": "1", "PortMask": "0x0010", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x16 card plugged in to stack, Or x8 card plugged in= to Lane 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7010002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -831,7 +751,7 @@ "PerPkg": "1", "PortMask": "0x0020", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x4 card is plugged in to slot 1", - "UMask": "0x7020002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -844,7 +764,7 @@ "PerPkg": "1", "PortMask": "0x0040", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x8 card plugged in to Lane 2/3, Or x4 card is plugg= ed in to slot 1", - "UMask": "0x7040002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -857,7 +777,7 @@ "PerPkg": "1", "PortMask": "0x0080", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x4 card is plugged in to slot 3", - "UMask": "0x7080002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -974,7 +894,6 @@ "Counter": "0,1", "EventCode": "0x83", "EventName": "UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.ALL_PARTS", - "Experimental": "1", "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x00ff", @@ -1082,7 +1001,6 @@ "Counter": "0,1", "EventCode": "0x83", "EventName": "UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.ALL_PARTS", - "Experimental": "1", "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x00ff", @@ -1299,7 +1217,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests : Passing data= to be written : How often different queues (e.g. channel / fc) ask to send= request into pipeline : Only for posted requests", - "UMask": "0x70ff020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -1351,7 +1269,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests : Request Owne= rship : How often different queues (e.g. channel / fc) ask to send request = into pipeline : Only for posted requests", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1364,7 +1282,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests : Writing line= : How often different queues (e.g. channel / fc) ask to send request into = pipeline : Only for posted requests", - "UMask": "0x70ff010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -1377,7 +1295,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests granted : Pass= ing data to be written : How often different queues (e.g. channel / fc) are= allowed to send request into pipeline : Only for posted requests", - "UMask": "0x70ff020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -1390,7 +1308,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests granted : Issu= ing final read or write of line : How often different queues (e.g. channel = / fc) are allowed to send request into pipeline", - "UMask": "0x70ff008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1403,7 +1321,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests granted : Proc= essing response from IOMMU : How often different queues (e.g. channel / fc)= are allowed to send request into pipeline", - "UMask": "0x70ff002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1416,7 +1334,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests granted : Issu= ing to IOMMU : How often different queues (e.g. channel / fc) are allowed t= o send request into pipeline", - "UMask": "0x70ff001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1429,7 +1347,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests granted : Requ= est Ownership : How often different queues (e.g. channel / fc) are allowed = to send request into pipeline : Only for posted requests", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1442,7 +1360,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests granted : Writ= ing line : How often different queues (e.g. channel / fc) are allowed to se= nd request into pipeline : Only for posted requests", - "UMask": "0x70ff010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -1962,7 +1880,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Request Ownership : PCIe Request complete : = Only for posted requests : Each PCIe request is broken down into a series o= f cacheline granular requests and each cacheline size request may need to m= ake multiple passes through the pipeline (e.g. for posted interrupts or mul= ti-cast). Each time a single PCIe request completes all its cacheline gra= nular requests, it advances pointer.", - "UMask": "0x70ff020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -1975,7 +1893,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Request Ownership : Writing line : Only for = posted requests : Only for posted requests", - "UMask": "0x70ff008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1988,7 +1906,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Request Ownership : Issuing final read or wr= ite of line : Only for posted requests", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -2001,7 +1919,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Request Ownership : Passing data to be writt= en : Only for posted requests : Only for posted requests", - "UMask": "0x70ff010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -2014,7 +1932,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Processing response from IOMMU : Passing dat= a to be written : Only for posted requests", - "UMask": "0x70ff008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -2026,7 +1944,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x00FF", - "UMask": "0x70ff002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2039,7 +1957,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Processing response from IOMMU : Request Own= ership : Only for posted requests", - "UMask": "0x70ff001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -2052,7 +1970,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Processing response from IOMMU : Writing lin= e : Only for posted requests", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -2065,7 +1983,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "PCIe Request - pass complete : Passing data = to be written : Each PCIe request is broken down into a series of cacheline= granular requests and each cacheline size request may need to make multipl= e passes through the pipeline (e.g. for posted interrupts or multi-cast). = Each time a cacheline completes a single pass (e.g. posts a write to singl= e multi-cast target) it advances state : Only for posted requests", - "UMask": "0x70ff020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -2091,7 +2009,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "PCIe Request - pass complete : Request Owner= ship : Each PCIe request is broken down into a series of cacheline granular= requests and each cacheline size request may need to make multiple passes = through the pipeline (e.g. for posted interrupts or multi-cast). Each tim= e a cacheline completes a single pass (e.g. posts a write to single multi-c= ast target) it advances state : Only for posted requests", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -2104,7 +2022,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "PCIe Request - pass complete : Writing line = : Each PCIe request is broken down into a series of cacheline granular requ= ests and each cacheline size request may need to make multiple passes throu= gh the pipeline (e.g. for posted interrupts or multi-cast). Each time a c= acheline completes a single pass (e.g. posts a write to single multi-cast t= arget) it advances state : Only for posted requests", - "UMask": "0x70ff010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -2309,7 +2227,7 @@ "PerPkg": "1", "PortMask": "0x0001", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x16 card plugged in to Lane 0/1/2/3, Or x8 card plugged in to Lane= 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7001002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2322,7 +2240,7 @@ "PerPkg": "1", "PortMask": "0x0002", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x4 card is plugged in to slot 1", - "UMask": "0x7002002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2335,7 +2253,7 @@ "PerPkg": "1", "PortMask": "0x0004", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x8 card plugged in to Lane 2/3, Or x4 card is plugged in to slot 2= ", - "UMask": "0x7004002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2348,7 +2266,7 @@ "PerPkg": "1", "PortMask": "0x0008", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x4 card is plugged in to slot 3", - "UMask": "0x7008002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2361,7 +2279,7 @@ "PerPkg": "1", "PortMask": "0x0010", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x16 card plugged in to Lane 4/5/6/7, Or x8 card plugged in to Lane= 4/5, Or x4 card is plugged in to slot 4", - "UMask": "0x7010002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2374,7 +2292,7 @@ "PerPkg": "1", "PortMask": "0x0020", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x4 card is plugged in to slot 5", - "UMask": "0x7020002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2387,7 +2305,7 @@ "PerPkg": "1", "PortMask": "0x0040", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x8 card plugged in to Lane 6/7, Or x4 card is plugged in to slot 6= ", - "UMask": "0x7040002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2400,7 +2318,7 @@ "PerPkg": "1", "PortMask": "0x0080", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x4 card is plugged in to slot 7", - "UMask": "0x7080002", + "UMask": "0x2", "Unit": "IIO" }, { diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 88919998da3d..bede1b1e2c3b 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -9,7 +9,7 @@ GenuineIntel-6-4F,v23,broadwellx,core GenuineIntel-6-55-[56789ABCDEF],v1.23,cascadelakex,core GenuineIntel-6-DD,v1.00,clearwaterforest,core GenuineIntel-6-9[6C],v1.05,elkhartlake,core -GenuineIntel-6-CF,v1.09,emeraldrapids,core +GenuineIntel-6-CF,v1.11,emeraldrapids,core GenuineIntel-6-5[CF],v13,goldmont,core GenuineIntel-6-7A,v1.01,goldmontplus,core GenuineIntel-6-B6,v1.03,grandridge,core --=20 2.48.0.rc2.279.g1de40edade-goog From nobody Sun Dec 14 13:49:51 2025 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 65A3013B280 for ; Thu, 16 Jan 2025 06:44:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009881; cv=none; b=TdvWdpdqq7ph11orZEq+4c1omLX9XxsMqyCvU2mVBos7L2p7GN8dMaVwp7DA5m/+4mq4Ix3j/z541hZ1bFMqlB8TAVJ8by8aiBDLSDYZCMCnsfucATDwEwJ3mGC7yXVZvRrs8xaxyrDZC/F78iJVOR/2L2VZD38T3Y+ypdD0x40= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009881; c=relaxed/simple; bh=QkJlHlTvjTYMNbMSN5XKHU2wvtjCfvMnFH314UOVxho=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=R5PUl0dGw0WlTIwzcp2yYyMXJFLKfbtTp/qXwKO9j4CcdidVc8lGddg4NEZ6mT1MPNaQav1qfWyrfaQY/fQUab8KAT2W7EUatckYt0fD35t2AlATBgPd5PuqFu/jrtuyPF11/RC+mf4d+hgJy2Jz9Y46iupal2QqOcH4sDBiHBM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=h/DX22UV; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="h/DX22UV" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-e5742f52896so1631599276.3 for ; Wed, 15 Jan 2025 22:44:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737009869; x=1737614669; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=bHbhT7snRIIhs4qdC2gUOczvh9PHgZMJ9qZGWZ68Xhc=; b=h/DX22UVLpfGRIR8m/1YzoLq7jYzXqRSRg7El/RCI4+gXoo/ElvlsmkDEyNBTYUEQ7 SvKrJ9UeYpx7zwxD6EaeuBZ0bl6jypf/YCT/coMiuaQ2owi5EYHOYNJmkL77F9L1TGat 407dqQ8XqQtqKGIQ7CmSXD+09DIHsOovsOf9awWUBTGV5iiWImWppwtRW4VWeSQ2pFpD DlefiQDDcHxF70VlHe3/kHUIKWfnw3r0DWLrBL+fpHwfGp7COTkaRmZnUo/6I1TMEY8w gatltAA90wIDw9Yqik0eJhIwDu7skq26OWoUtW0HXjojeQAXlJoZ+PMCvYNQsgvsbvdD dofA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737009869; x=1737614669; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=bHbhT7snRIIhs4qdC2gUOczvh9PHgZMJ9qZGWZ68Xhc=; b=f/lAzEoMUBHNeomsM89zYigxwSmsXbPgVgGdM2Md0o/cNFpVnIlSfTJuihvRhYFcfb z3k2m1YkrLsc+b8VN4244iElAklBxWHnM3M2oZriGPI3EseYRi6gexuTT584RlRVMSkg caYp5zvdHjwyhaQd8kGHmkj5YfYdWkYQglzHo3cGyIwSFOidGokjb/IuWhY0yA0KMPE2 K9epXJKhNgsbHJMuLZ6tPjZZUJE9oTAGojLg1EAKrSMsuyP74sc0Ox+br/UNZQLKhuu0 pegSdorpuGbFR8+YdxVOI/t5qFrLiK/vz9zgakVsDMXmJlswRorX0pvSrHOGWctIMa6l WDiw== X-Forwarded-Encrypted: i=1; AJvYcCWE+onF4+BI+oy+hjG4gKIHF/Ef1TUpRnBWktEs2xnnoLpZ6a7hs3uBpEQjYMx/HMMCoBh5P2o4AXCVjuk=@vger.kernel.org X-Gm-Message-State: AOJu0YwwYKrgPdR+OY/vl1Z6mzkED0Hu5LS/Mogr5aR9e/1ekJ1wSzSo 8oY+DYQ3uFfSs6gBdtYe6MHjiap06/mbbSLS5Ywr4yB0BXTlOLMFweF+l3HxiRjs0tcjc+M4CB/ aEXGhHg== X-Google-Smtp-Source: AGHT+IG9pDs04rxX4RsJDXjpcvmd3V1+/cSheUDqyiba22SkQLYOa2h80DlgkfMGCJDJ0YxOde+APaNYSyde X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:7bd:ce19:c701:bfa3]) (user=irogers job=sendgmr) by 2002:a5b:981:0:b0:e3d:da4c:7981 with SMTP id 3f1490d57ef6-e54f01c71bfmr73306276.7.1737009868750; Wed, 15 Jan 2025 22:44:28 -0800 (PST) Date: Wed, 15 Jan 2025 22:43:42 -0800 In-Reply-To: <20250116064355.344245-1-irogers@google.com> Message-Id: <20250116064355.344245-11-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250116064355.344245-1-irogers@google.com> X-Mailer: git-send-email 2.48.0.rc2.279.g1de40edade-goog Subject: [PATCH v2 10/23] perf vendor events: Update GrandRidge events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.03 to v1.05. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v1.05: https://github.com/intel/perfmon/commit/3b2e3528fbfb5576f443607ac9d772de88a= ed72c https://github.com/intel/perfmon/commit/9bc1815536ff1f6fe73693a19a410b6a711= 740c2 The TMA 5.01 update is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Update uncore IIO events umask with the change: https://github.com/intel/perfmon/commit/d78e8a166537c9ceab4f2e901dc96c53667= a2174 which should address an issue originally raised by Michael Petlan: Reported-by: Michael Petlan Closes: https://lore.kernel.org/all/alpine.LRH.2.20.2401300733310.11354@Die= go/ Co-developed-by: Caleb Biggers Signed-off-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../arch/x86/grandridge/grr-metrics.json | 284 ++++++------------ .../arch/x86/grandridge/pipeline.json | 3 +- .../arch/x86/grandridge/uncore-cache.json | 4 +- .../x86/grandridge/uncore-interconnect.json | 60 ++++ .../arch/x86/grandridge/uncore-io.json | 214 ++++++------- .../arch/x86/grandridge/uncore-memory.json | 2 +- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 7 files changed, 260 insertions(+), 309 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/grandridge/grr-metrics.json b/t= ools/perf/pmu-events/arch/x86/grandridge/grr-metrics.json index 07e542297e93..2f9959c61718 100644 --- a/tools/perf/pmu-events/arch/x86/grandridge/grr-metrics.json +++ b/tools/perf/pmu-events/arch/x86/grandridge/grr-metrics.json @@ -1,4 +1,11 @@ [ + { + "BriefDescription": "C10 residency percent per package", + "MetricExpr": "cstate_pkg@c10\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C10_Pkg_Residency", + "ScaleUnit": "100%" + }, { "BriefDescription": "C1 residency percent per core", "MetricExpr": "cstate_core@c1\\-residency@ / TSC", @@ -7,17 +14,24 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per core", - "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "BriefDescription": "C2 residency percent per package", + "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", "MetricGroup": "Power", - "MetricName": "C6_Core_Residency", + "MetricName": "C2_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per module", - "MetricExpr": "cstate_module@c6\\-residency@ / TSC", + "BriefDescription": "C3 residency percent per package", + "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", "MetricGroup": "Power", - "MetricName": "C6_Module_Residency", + "MetricName": "C3_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C6 residency percent per core", + "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Core_Residency", "ScaleUnit": "100%" }, { @@ -28,7 +42,21 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles.", + "BriefDescription": "C7 residency percent per core", + "MetricExpr": "cstate_core@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Core_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C8 residency percent per package", + "MetricExpr": "cstate_pkg@c8\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C8_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles", "MetricExpr": "CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY", "MetricName": "cpi", "ScaleUnit": "1per_instr" @@ -49,31 +77,31 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte page sizes) caused by demand data loads to the total number of c= ompleted instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRE= D.ANY", "MetricName": "dtlb_2nd_level_2mb_large_page_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte page sizes) caused by demand data loads to the total number of = completed instructions. This implies it missed in the Data Translation Look= aside Buffer (DTLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte page sizes) caused by demand data loads to the total number of = completed instructions. This implies it missed in the Data Translation Look= aside Buffer (DTLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data loads to the total number of complete= d instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "dtlb_2nd_level_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data stores to the total number of complet= ed instructions", "MetricExpr": "DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", "MetricName": "dtlb_2nd_level_store_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB", "ScaleUnit": "1per_instr" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU.", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e6 / durati= on_time", "MetricName": "io_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM + UNC_CHA_TOR_INSERTS.= IO_ITOMCACHENEAR) * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_write", "ScaleUnit": "1MB/s" @@ -82,14 +110,14 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total n= umber of completed instructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY= ", "MetricName": "itlb_2nd_level_large_page_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by a code fetch to the total number of completed ins= tructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "itlb_2nd_level_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB", "ScaleUnit": "1per_instr" }, { @@ -188,17 +216,15 @@ { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to certain allocation restrictions", "MetricExpr": "tma_core_bound", - "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_allocation_restriction", - "MetricThreshold": "tma_allocation_restriction > 0.1 & (tma_core_b= ound > 0.1 & tma_backend_bound > 0.1)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend due to backend stalls", "MetricExpr": "TOPDOWN_BE_BOUND.ALL_P / (6 * CPU_CLK_UNHALTED.CORE= )", - "MetricGroup": "TopdownL1;tma_L1_group", + "MetricGroup": "Slots;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", - "MetricThreshold": "tma_backend_bound > 0.1", "MetricgroupNoGroup": "TopdownL1", "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend due to backend stalls. Note that uops must= be available for consumption in order for this event to count. If a uop is= not available (IQ is empty), this event will not count", "ScaleUnit": "100%" @@ -206,104 +232,92 @@ { "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend because allocation is stalled due to a misp= redicted jump or a machine clear", "MetricExpr": "TOPDOWN_BAD_SPECULATION.ALL_P / (6 * CPU_CLK_UNHALT= ED.CORE)", - "MetricGroup": "TopdownL1;tma_L1_group", + "MetricGroup": "Slots;TopdownL1;tma_L1_group", "MetricName": "tma_bad_speculation", - "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend because allocation is stalled due to a mis= predicted jump or a machine clear. Only issue slots wasted due to fast nuke= s such as memory ordering nukes are counted. Other nukes are not accounted = for. Counts all issue slots blocked during this recovery window including r= elevant microcode flows and while uops are not yet available in the instruc= tion queue (IQ). Also includes the issue slots that were consumed by the ba= ckend but were thrown away because they were younger than the mispredict or= machine clear.", + "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend because allocation is stalled due to a mis= predicted jump or a machine clear. Only issue slots wasted due to fast nuke= s such as memory ordering nukes are counted. Other nukes are not accounted = for. Counts all issue slots blocked during this recovery window including r= elevant microcode flows and while uops are not yet available in the instruc= tion queue (IQ). Also includes the issue slots that were consumed by the ba= ckend but were thrown away because they were younger than the mispredict or= machine clear", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to BACLEARS, which occurs when the Branch T= arget Buffer (BTB) prediction or lack thereof, was corrected by a later bra= nch predictor in the frontend", "MetricExpr": "TOPDOWN_FE_BOUND.BRANCH_DETECT / (6 * CPU_CLK_UNHAL= TED.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_latency_gr= oup", "MetricName": "tma_branch_detect", - "MetricThreshold": "tma_branch_detect > 0.05 & (tma_ifetch_latency= > 0.15 & tma_frontend_bound > 0.2)", - "PublicDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to BACLEARS, which occurs when the Branch = Target Buffer (BTB) prediction or lack thereof, was corrected by a later br= anch predictor in the frontend. Includes BACLEARS due to all branch types i= ncluding conditional and unconditional jumps, returns, and indirect branche= s.", + "PublicDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to BACLEARS, which occurs when the Branch = Target Buffer (BTB) prediction or lack thereof, was corrected by a later br= anch predictor in the frontend. Includes BACLEARS due to all branch types i= ncluding conditional and unconditional jumps, returns, and indirect branche= s", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to branch mispredicts", "MetricExpr": "TOPDOWN_BAD_SPECULATION.MISPREDICT / (6 * CPU_CLK_U= NHALTED.CORE)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", + "MetricGroup": "Slots;TopdownL2;tma_L2_group;tma_bad_speculation_g= roup", "MetricName": "tma_branch_mispredicts", - "MetricThreshold": "tma_branch_mispredicts > 0.05 & tma_bad_specul= ation > 0.15", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to BTCLEARS, which occurs when the Branch T= arget Buffer (BTB) predicts a taken branch.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to BTCLEARS, which occurs when the Branch T= arget Buffer (BTB) predicts a taken branch", "MetricExpr": "TOPDOWN_FE_BOUND.BRANCH_RESTEER / (6 * CPU_CLK_UNHA= LTED.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_latency_gr= oup", "MetricName": "tma_branch_resteer", - "MetricThreshold": "tma_branch_resteer > 0.05 & (tma_ifetch_latenc= y > 0.15 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to the microcode sequencer (MS).", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to the microcode sequencer (MS)", "MetricExpr": "TOPDOWN_FE_BOUND.CISC / (6 * CPU_CLK_UNHALTED.CORE)= ", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_bandwidth_= group", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.05 & (tma_ifetch_bandwidth > 0.1 = & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of cycles due to backend bo= und stalls that are bounded by core restrictions and not attributed to an o= utstanding load or stores, or resource limitation", "MetricExpr": "TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS / (6 * CPU_CLK_= UNHALTED.CORE)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricGroup": "Slots;TopdownL2;tma_L2_group;tma_backend_bound_gro= up", "MetricName": "tma_core_bound", - "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.1= ", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to decode stalls.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to decode stalls", "MetricExpr": "TOPDOWN_FE_BOUND.DECODE / (6 * CPU_CLK_UNHALTED.COR= E)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_bandwidth_= group", "MetricName": "tma_decode", - "MetricThreshold": "tma_decode > 0.05 & (tma_ifetch_bandwidth > 0.= 1 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that does not require the = use of microcode, classified as a fast nuke, due to memory ordering, memory= disambiguation and memory renaming", "MetricExpr": "TOPDOWN_BAD_SPECULATION.FASTNUKE / (6 * CPU_CLK_UNH= ALTED.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_machine_clears_gr= oup", "MetricName": "tma_fast_nuke", - "MetricThreshold": "tma_fast_nuke > 0.05 & (tma_machine_clears > 0= .05 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to frontend stalls.", + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to frontend stalls", "MetricExpr": "TOPDOWN_FE_BOUND.ALL_P / (6 * CPU_CLK_UNHALTED.CORE= )", - "MetricGroup": "TopdownL1;tma_L1_group", + "MetricGroup": "Slots;TopdownL1;tma_L1_group", "MetricName": "tma_frontend_bound", - "MetricThreshold": "tma_frontend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to instruction cache misses.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to instruction cache misses", "MetricExpr": "TOPDOWN_FE_BOUND.ICACHE / (6 * CPU_CLK_UNHALTED.COR= E)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_latency_gr= oup", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_ifetch_latency= > 0.15 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend bandwidth restrictions due to d= ecode, predecode, cisc, and other limitations.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend bandwidth restrictions due to d= ecode, predecode, cisc, and other limitations", "MetricExpr": "TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH / (6 * CPU_CLK_= UNHALTED.CORE)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricGroup": "Slots;TopdownL2;tma_L2_group;tma_frontend_bound_gr= oup", "MetricName": "tma_ifetch_bandwidth", - "MetricThreshold": "tma_ifetch_bandwidth > 0.1 & tma_frontend_boun= d > 0.2", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend latency restrictions due to ica= che misses, itlb misses, branch detection, and resteer limitations.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend latency restrictions due to ica= che misses, itlb misses, branch detection, and resteer limitations", "MetricExpr": "TOPDOWN_FE_BOUND.FRONTEND_LATENCY / (6 * CPU_CLK_UN= HALTED.CORE)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricGroup": "Slots;TopdownL2;tma_L2_group;tma_frontend_bound_gr= oup", "MetricName": "tma_ifetch_latency", - "MetricThreshold": "tma_ifetch_latency > 0.15 & tma_frontend_bound= > 0.2", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, @@ -331,32 +345,6 @@ "MetricGroup": "Flops", "MetricName": "tma_info_arith_inst_mix_ipfparith_scalar_sp" }, - { - "BriefDescription": "Percentage of time that retirement is stalled= due to a first level data TLB miss", - "MetricExpr": "tma_info_bottleneck_dtlb_miss_bound_cycles", - "MetricName": "tma_info_bottleneck_%_dtlb_miss_bound_cycles" - }, - { - "BriefDescription": "Percentage of time that allocation and retire= ment is stalled by the Frontend Cluster due to an Ifetch Miss, either Icach= e or ITLB Miss", - "MetricExpr": "tma_info_bottleneck_ifetch_miss_bound_cycles", - "MetricGroup": "Ifetch", - "MetricName": "tma_info_bottleneck_%_ifetch_miss_bound_cycles", - "PublicDescription": "Percentage of time that allocation and retir= ement is stalled by the Frontend Cluster due to an Ifetch Miss, either Icac= he or ITLB Miss. See Info.Ifetch_Bound" - }, - { - "BriefDescription": "Percentage of time that retirement is stalled= due to an L1 miss", - "MetricExpr": "tma_info_bottleneck_load_miss_bound_cycles", - "MetricGroup": "Load_Store_Miss", - "MetricName": "tma_info_bottleneck_%_load_miss_bound_cycles", - "PublicDescription": "Percentage of time that retirement is stalle= d due to an L1 miss. See Info.Load_Miss_Bound" - }, - { - "BriefDescription": "Percentage of time that retirement is stalled= by the Memory Cluster due to a pipeline stall", - "MetricExpr": "tma_info_bottleneck_mem_exec_bound_cycles", - "MetricGroup": "Mem_Exec", - "MetricName": "tma_info_bottleneck_%_mem_exec_bound_cycles", - "PublicDescription": "Percentage of time that retirement is stalle= d by the Memory Cluster due to a pipeline stall. See Info.Mem_Exec_Bound" - }, { "BriefDescription": "Percentage of time that retirement is stalled= due to a first level data TLB miss", "MetricExpr": "100 * (LD_HEAD.DTLB_MISS_AT_RET + LD_HEAD.PGWALK_AT= _RET) / CPU_CLK_UNHALTED.CORE", @@ -438,21 +426,6 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / BACLEARS.ANY", "MetricName": "tma_info_br_mispredict_bound_branch_mispredict_to_u= nknown_branch_ratio" }, - { - "BriefDescription": "Percentage of time that allocation is stalled= due to load buffer full", - "MetricExpr": "tma_info_buffer_stalls_load_buffer_stall_cycles", - "MetricName": "tma_info_buffer_stalls_%_load_buffer_stall_cycles" - }, - { - "BriefDescription": "Percentage of time that allocation is stalled= due to memory reservation stations full", - "MetricExpr": "tma_info_buffer_stalls_mem_rsv_stall_cycles", - "MetricName": "tma_info_buffer_stalls_%_mem_rsv_stall_cycles" - }, - { - "BriefDescription": "Percentage of time that allocation is stalled= due to store buffer full", - "MetricExpr": "tma_info_buffer_stalls_store_buffer_stall_cycles", - "MetricName": "tma_info_buffer_stalls_%_store_buffer_stall_cycles" - }, { "BriefDescription": "Percentage of time that allocation is stalled= due to load buffer full", "MetricExpr": "100 * MEM_SCHEDULER_BLOCK.LD_BUF / CPU_CLK_UNHALTED= .CORE", @@ -474,7 +447,8 @@ { "BriefDescription": "Cycles Per Instruction", "MetricExpr": "CPU_CLK_UNHALTED.CORE / INST_RETIRED.ANY", - "MetricName": "tma_info_core_cpi" + "MetricName": "tma_info_core_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Floating Point Operations Per Cycle", @@ -492,21 +466,6 @@ "MetricExpr": "TOPDOWN_RETIRING.ALL_P / INST_RETIRED.ANY", "MetricName": "tma_info_core_upi" }, - { - "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L2", - "MetricExpr": "tma_info_ifetch_miss_bound_ifetchmissbound_with_l2h= it", - "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 2hit" - }, - { - "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L3", - "MetricExpr": "tma_info_ifetch_miss_bound_ifetchmissbound_with_l3h= it", - "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 3hit" - }, - { - "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss subsequently misses in the L3", - "MetricExpr": "100 * MEM_BOUND_STALLS_IFETCH.LLC_MISS / MEM_BOUND_= STALLS_IFETCH.ALL", - "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 3miss" - }, { "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L2", "MetricExpr": "100 * MEM_BOUND_STALLS_IFETCH.L2_HIT / MEM_BOUND_ST= ALLS_IFETCH.ALL", @@ -519,24 +478,6 @@ "MetricName": "tma_info_ifetch_miss_bound_ifetchmissbound_with_l3h= it", "ScaleUnit": "100%" }, - { - "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L2", - "MetricExpr": "tma_info_load_miss_bound_loadmissbound_with_l2hit", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l2hit" - }, - { - "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L3", - "MetricExpr": "tma_info_load_miss_bound_loadmissbound_with_l3hit", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l3hit" - }, - { - "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that subsequently misses the L3", - "MetricExpr": "100 * MEM_BOUND_STALLS_LOAD.LLC_MISS / MEM_BOUND_ST= ALLS_LOAD.ALL", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l3mis= s" - }, { "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L2", "MetricExpr": "100 * MEM_BOUND_STALLS_LOAD.L2_HIT / MEM_BOUND_STAL= LS_LOAD.ALL", @@ -584,16 +525,6 @@ "MetricExpr": "1e3 * MACHINE_CLEARS.SMC / INST_RETIRED.ANY", "MetricName": "tma_info_machine_clear_bound_machine_clears_smc_pki" }, - { - "BriefDescription": "Percentage of total non-speculative loads wit= h an address aliasing block", - "MetricExpr": "tma_info_mem_exec_blocks_loads_with_adressaliasing", - "MetricName": "tma_info_mem_exec_blocks_%_loads_with_adressaliasin= g" - }, - { - "BriefDescription": "Percentage of total non-speculative loads wit= h a store forward or unknown store address block", - "MetricExpr": "tma_info_mem_exec_blocks_loads_with_storefwdblk", - "MetricName": "tma_info_mem_exec_blocks_%_loads_with_storefwdblk" - }, { "BriefDescription": "Percentage of total non-speculative loads wit= h an address aliasing block", "MetricExpr": "100 * LD_BLOCKS.ADDRESS_ALIAS / MEM_UOPS_RETIRED.AL= L_LOADS", @@ -606,31 +537,6 @@ "MetricName": "tma_info_mem_exec_blocks_loads_with_storefwdblk", "ScaleUnit": "100%" }, - { - "BriefDescription": "Percentage of Memory Execution Bound due to a= first level data cache miss", - "MetricExpr": "tma_info_mem_exec_bound_loadhead_with_l1miss", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_l1miss" - }, - { - "BriefDescription": "Percentage of Memory Execution Bound due to o= ther block cases, such as pipeline conflicts, fences, etc", - "MetricExpr": "tma_info_mem_exec_bound_loadhead_with_otherpipeline= blks", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_otherpipeli= neblks" - }, - { - "BriefDescription": "Percentage of Memory Execution Bound due to a= pagewalk", - "MetricExpr": "tma_info_mem_exec_bound_loadhead_with_pagewalk", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_pagewalk" - }, - { - "BriefDescription": "Percentage of Memory Execution Bound due to a= second level TLB miss", - "MetricExpr": "tma_info_mem_exec_bound_loadhead_with_stlbhit", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_stlbhit" - }, - { - "BriefDescription": "Percentage of Memory Execution Bound due to a= store forward address match", - "MetricExpr": "tma_info_mem_exec_bound_loadhead_with_storefwding", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_storefwding" - }, { "BriefDescription": "Percentage of Memory Execution Bound due to a= first level data cache miss", "MetricExpr": "100 * LD_HEAD.L1_MISS_AT_RET / LD_HEAD.ANY_AT_RET", @@ -686,11 +592,6 @@ "MetricExpr": "1e3 * MEM_UOPS_RETIRED.ALL_LOADS / TOPDOWN_RETIRING= .ALL_P", "MetricName": "tma_info_mem_mix_memload_ratio" }, - { - "BriefDescription": "Percentage of time that the core is stalled d= ue to a TPAUSE or UMWAIT instruction", - "MetricExpr": "tma_info_serialization_tpause_cycles", - "MetricName": "tma_info_serialization _%_tpause_cycles" - }, { "BriefDescription": "Percentage of time that the core is stalled d= ue to a TPAUSE or UMWAIT instruction", "MetricExpr": "100 * SERIALIZATION.C01_MS_SCB / (6 * CPU_CLK_UNHAL= TED.CORE)", @@ -711,14 +612,17 @@ }, { "BriefDescription": "Fraction of cycles spent in Kernel mode", - "MetricExpr": "cpu@CPU_CLK_UNHALTED.CORE_P@k / CPU_CLK_UNHALTED.CO= RE", - "MetricGroup": "Summary", + "MetricExpr": "CPU_CLK_UNHALTED.CORE_P:k / CPU_CLK_UNHALTED.CORE", "MetricName": "tma_info_system_kernel_utilization" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.CORE_P / CPU_CLK_UNHALTED.CORE", + "MetricName": "tma_info_system_mux" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "CPU_CLK_UNHALTED.CORE / CPU_CLK_UNHALTED.REF_TSC", - "MetricGroup": "Power", "MetricName": "tma_info_system_turbo_utilization" }, { @@ -742,102 +646,90 @@ "MetricName": "tma_info_uop_mix_x87_uop_ratio" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to Instruction Table Lookaside Buffer (ITLB= ) misses.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to Instruction Table Lookaside Buffer (ITLB= ) misses", "MetricExpr": "TOPDOWN_FE_BOUND.ITLB_MISS / (6 * CPU_CLK_UNHALTED.= CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_latency_gr= oup", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_ifetch_latency >= 0.15 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend because allocation is stalled due to a mach= ine clear (nuke) of any kind including memory ordering and memory disambigu= ation", "MetricExpr": "TOPDOWN_BAD_SPECULATION.MACHINE_CLEARS / (6 * CPU_C= LK_UNHALTED.CORE)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", + "MetricGroup": "Slots;TopdownL2;tma_L2_group;tma_bad_speculation_g= roup", "MetricName": "tma_machine_clears", - "MetricThreshold": "tma_machine_clears > 0.05 & tma_bad_speculatio= n > 0.15", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to memory reservation stalls in which a sched= uler is not able to accept uops", "MetricExpr": "TOPDOWN_BE_BOUND.MEM_SCHEDULER / (6 * CPU_CLK_UNHAL= TED.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_resource_bound_gr= oup", "MetricName": "tma_mem_scheduler", - "MetricThreshold": "tma_mem_scheduler > 0.1 & (tma_resource_bound = > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to IEC or FPC RAT stalls, which can be due to= FIQ or IEC reservation stalls in which the integer, floating point or SIMD= scheduler is not able to accept uops", "MetricExpr": "TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER / (6 * CPU_CLK_U= NHALTED.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_resource_bound_gr= oup", "MetricName": "tma_non_mem_scheduler", - "MetricThreshold": "tma_non_mem_scheduler > 0.1 & (tma_resource_bo= und > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that requires the use of m= icrocode (slow nuke)", "MetricExpr": "TOPDOWN_BAD_SPECULATION.NUKE / (6 * CPU_CLK_UNHALTE= D.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_machine_clears_gr= oup", "MetricName": "tma_nuke", - "MetricThreshold": "tma_nuke > 0.05 & (tma_machine_clears > 0.05 &= tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to other common frontend stalls not categor= ized.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to other common frontend stalls not categor= ized", "MetricExpr": "TOPDOWN_FE_BOUND.OTHER / (6 * CPU_CLK_UNHALTED.CORE= )", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_bandwidth_= group", "MetricName": "tma_other_fb", - "MetricThreshold": "tma_other_fb > 0.05 & (tma_ifetch_bandwidth > = 0.1 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to wrong predecodes.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to wrong predecodes", "MetricExpr": "TOPDOWN_FE_BOUND.PREDECODE / (6 * CPU_CLK_UNHALTED.= CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_bandwidth_= group", "MetricName": "tma_predecode", - "MetricThreshold": "tma_predecode > 0.05 & (tma_ifetch_bandwidth >= 0.1 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the physical register file unable to accep= t an entry (marble stalls)", "MetricExpr": "TOPDOWN_BE_BOUND.REGISTER / (6 * CPU_CLK_UNHALTED.C= ORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_resource_bound_gr= oup", "MetricName": "tma_register", - "MetricThreshold": "tma_register > 0.1 & (tma_resource_bound > 0.2= & tma_backend_bound > 0.1)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the reorder buffer being full (ROB stalls)= ", "MetricExpr": "TOPDOWN_BE_BOUND.REORDER_BUFFER / (6 * CPU_CLK_UNHA= LTED.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_resource_bound_gr= oup", "MetricName": "tma_reorder_buffer", - "MetricThreshold": "tma_reorder_buffer > 0.1 & (tma_resource_bound= > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of cycles the core is stall= ed due to a resource limitation", "MetricExpr": "tma_backend_bound - tma_core_bound", - "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricGroup": "Slots;TopdownL2;tma_L2_group;tma_backend_bound_gro= up", "MetricName": "tma_resource_bound", - "MetricThreshold": "tma_resource_bound > 0.2 & tma_backend_bound >= 0.1", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that result = in retirement slots", "MetricExpr": "TOPDOWN_RETIRING.ALL_P / (6 * CPU_CLK_UNHALTED.CORE= )", - "MetricGroup": "TopdownL1;tma_L1_group", + "MetricGroup": "Slots;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", - "MetricThreshold": "tma_retiring > 0.75", "MetricgroupNoGroup": "TopdownL1", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to scoreboards from the instruction queue (IQ= ), jump execution unit (JEU), or microcode sequencer (MS)", "MetricExpr": "TOPDOWN_BE_BOUND.SERIALIZATION / (6 * CPU_CLK_UNHAL= TED.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_resource_bound_gr= oup", "MetricName": "tma_serialization", - "MetricThreshold": "tma_serialization > 0.1 & (tma_resource_bound = > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%" }, { diff --git a/tools/perf/pmu-events/arch/x86/grandridge/pipeline.json b/tool= s/perf/pmu-events/arch/x86/grandridge/pipeline.json index b67c0c89054d..40fa4f5ae261 100644 --- a/tools/perf/pmu-events/arch/x86/grandridge/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/grandridge/pipeline.json @@ -1,6 +1,6 @@ [ { - "BriefDescription": "Counts the number of cycles when any of the d= ividers are active.", + "BriefDescription": "Counts the number of cycles when any of the f= loating point or integer dividers are active.", "Counter": "0,1,2,3,4,5,6,7", "CounterMask": "1", "EventCode": "0xcd", @@ -185,7 +185,6 @@ "BriefDescription": "Fixed Counter: Counts the number of instructi= ons retired", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x1" }, diff --git a/tools/perf/pmu-events/arch/x86/grandridge/uncore-cache.json b/= tools/perf/pmu-events/arch/x86/grandridge/uncore-cache.json index 1eaf796601b1..6a80cf6cbd36 100644 --- a/tools/perf/pmu-events/arch/x86/grandridge/uncore-cache.json +++ b/tools/perf/pmu-events/arch/x86/grandridge/uncore-cache.json @@ -573,7 +573,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Code read from local IA that miss the cache", + "BriefDescription": "Code read from local IA", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_CRD", @@ -593,7 +593,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Data read opt from local IA that miss the cac= he", + "BriefDescription": "Data read opt from local IA", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_DRD_OPT", diff --git a/tools/perf/pmu-events/arch/x86/grandridge/uncore-interconnect.= json b/tools/perf/pmu-events/arch/x86/grandridge/uncore-interconnect.json index 6aaca4039107..2c18767511f3 100644 --- a/tools/perf/pmu-events/arch/x86/grandridge/uncore-interconnect.json +++ b/tools/perf/pmu-events/arch/x86/grandridge/uncore-interconnect.json @@ -183,6 +183,26 @@ "PerPkg": "1", "Unit": "IRP" }, + { + "BriefDescription": "Counts Timeouts - Set 0 : Fastpath Rejects", + "Counter": "0,1,2,3", + "EventCode": "0x1E", + "EventName": "UNC_I_MISC0.FAST_REJ", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x2", + "Unit": "IRP" + }, + { + "BriefDescription": "Counts Timeouts - Set 0 : Fastpath Requests", + "Counter": "0,1,2,3", + "EventCode": "0x1E", + "EventName": "UNC_I_MISC0.FAST_REQ", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x1", + "Unit": "IRP" + }, { "BriefDescription": "Misc Events - Set 1 : Lost Forward : Snoop pu= lled away ownership before a write was committed", "Counter": "0,1,2,3", @@ -193,6 +213,46 @@ "UMask": "0x10", "Unit": "IRP" }, + { + "BriefDescription": "Snoop Hit E/S responses", + "Counter": "0,1,2,3", + "EventCode": "0x12", + "EventName": "UNC_I_SNOOP_RESP.ALL_HIT_ES", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x74", + "Unit": "IRP" + }, + { + "BriefDescription": "Snoop Hit I responses", + "Counter": "0,1,2,3", + "EventCode": "0x12", + "EventName": "UNC_I_SNOOP_RESP.ALL_HIT_I", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x72", + "Unit": "IRP" + }, + { + "BriefDescription": "Snoop Hit M responses", + "Counter": "0,1,2,3", + "EventCode": "0x12", + "EventName": "UNC_I_SNOOP_RESP.ALL_HIT_M", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x78", + "Unit": "IRP" + }, + { + "BriefDescription": "Snoop miss responses", + "Counter": "0,1,2,3", + "EventCode": "0x12", + "EventName": "UNC_I_SNOOP_RESP.ALL_MISS", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x71", + "Unit": "IRP" + }, { "BriefDescription": "Inbound write (fast path) requests to coheren= t memory, received by the IRP resulting in write ownership requests issued = by IRP to the mesh.", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/grandridge/uncore-io.json b/too= ls/perf/pmu-events/arch/x86/grandridge/uncore-io.json index 33fc7b835abf..c5b05c71c56d 100644 --- a/tools/perf/pmu-events/arch/x86/grandridge/uncore-io.json +++ b/tools/perf/pmu-events/arch/x86/grandridge/uncore-io.json @@ -17,7 +17,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -29,7 +29,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -41,7 +41,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -53,7 +53,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -65,7 +65,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -77,7 +77,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -89,7 +89,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -101,7 +101,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -113,7 +113,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -125,7 +125,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff0ff", + "UMask": "0xff", "Unit": "IIO" }, { @@ -137,7 +137,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -149,7 +149,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -161,7 +161,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -173,7 +173,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -185,7 +185,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -197,7 +197,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -209,7 +209,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040040", + "UMask": "0x40", "Unit": "IIO" }, { @@ -221,7 +221,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080080", + "UMask": "0x80", "Unit": "IIO" }, { @@ -232,7 +232,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -244,7 +244,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -256,7 +256,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -268,7 +268,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -280,7 +280,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -292,7 +292,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -304,7 +304,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -316,7 +316,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -328,7 +328,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -339,7 +339,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -350,7 +350,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -361,7 +361,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -372,7 +372,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -383,7 +383,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -394,7 +394,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -405,7 +405,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -416,7 +416,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -427,7 +427,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -438,7 +438,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -449,7 +449,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x02", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -460,7 +460,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x04", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -471,7 +471,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x08", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -482,7 +482,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x10", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -493,7 +493,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x20", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -504,7 +504,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x40", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -515,7 +515,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x80", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -526,7 +526,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -537,7 +537,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x02", - "UMask": "0x7002001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -548,7 +548,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x04", - "UMask": "0x7004001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -559,7 +559,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x08", - "UMask": "0x7008001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -570,7 +570,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x10", - "UMask": "0x7010001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -581,7 +581,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x20", - "UMask": "0x7020001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -592,7 +592,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x40", - "UMask": "0x7040001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -603,7 +603,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x80", - "UMask": "0x7080001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -615,7 +615,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -627,7 +627,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -639,7 +639,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -651,7 +651,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -663,7 +663,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -675,7 +675,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -687,7 +687,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -699,7 +699,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -832,7 +832,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff080", + "UMask": "0x80", "Unit": "IIO" }, { @@ -844,7 +844,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff040", + "UMask": "0x40", "Unit": "IIO" }, { @@ -856,7 +856,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -868,7 +868,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -880,7 +880,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -892,7 +892,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -904,7 +904,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -925,7 +925,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -936,7 +936,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -947,7 +947,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -958,7 +958,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -969,7 +969,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -980,7 +980,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -991,7 +991,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1002,7 +1002,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1013,7 +1013,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1024,7 +1024,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1035,7 +1035,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1046,7 +1046,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1057,7 +1057,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1068,7 +1068,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1079,7 +1079,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1090,7 +1090,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1101,7 +1101,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1112,7 +1112,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1123,7 +1123,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1134,7 +1134,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1145,7 +1145,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1156,7 +1156,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1167,7 +1167,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1178,7 +1178,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1189,7 +1189,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1200,7 +1200,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1211,7 +1211,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1222,7 +1222,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1233,7 +1233,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1244,7 +1244,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1255,7 +1255,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1266,7 +1266,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1278,7 +1278,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1290,7 +1290,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1302,7 +1302,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1314,7 +1314,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1326,7 +1326,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1338,7 +1338,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1350,7 +1350,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1362,7 +1362,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080002", + "UMask": "0x2", "Unit": "IIO" } ] diff --git a/tools/perf/pmu-events/arch/x86/grandridge/uncore-memory.json b= /tools/perf/pmu-events/arch/x86/grandridge/uncore-memory.json index 7e6e6764f181..e75b3050ccd5 100644 --- a/tools/perf/pmu-events/arch/x86/grandridge/uncore-memory.json +++ b/tools/perf/pmu-events/arch/x86/grandridge/uncore-memory.json @@ -169,7 +169,7 @@ "Unit": "IMC" }, { - "BriefDescription": "Number of DRAM DCLK clock cycles while the ev= ent is enabled", + "BriefDescription": "Number of DRAM DCLK clock cycles while the ev= ent is enabled. DCLK is 1/4 of DRAM data rate.", "Counter": "0,1,2,3", "EventCode": "0x01", "EventName": "UNC_M_CLOCKTICKS", diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index bede1b1e2c3b..99b3101c4e4e 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -12,7 +12,7 @@ GenuineIntel-6-9[6C],v1.05,elkhartlake,core GenuineIntel-6-CF,v1.11,emeraldrapids,core GenuineIntel-6-5[CF],v13,goldmont,core GenuineIntel-6-7A,v1.01,goldmontplus,core -GenuineIntel-6-B6,v1.03,grandridge,core +GenuineIntel-6-B6,v1.05,grandridge,core GenuineIntel-6-A[DE],v1.02,graniterapids,core GenuineIntel-6-(3C|45|46),v35,haswell,core GenuineIntel-6-3F,v28,haswellx,core --=20 2.48.0.rc2.279.g1de40edade-goog From nobody Sun Dec 14 13:49:51 2025 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1D1F9194AD5 for ; Thu, 16 Jan 2025 06:44:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009905; cv=none; b=qjKYIp3bacD7sUmEmJG18MYuZd1C9UHMUOLEdwzR01qOxwozuNaXc7aooSMe6YUx0bYlSZou0je7e3w/ljl2QisyvYUfMknPcyzffuRaMwoz/CMsHajm9/+G2tN6M6yHRmMRK3KUKMQnDBTqQuuPkuDMl4nMVdMByzj/K3usXbc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009905; c=relaxed/simple; bh=cqRl+plHztRuHvRk+9aNCZ8c3n3IRK0nVCUCAftJI0s=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=bh/AuDOIvCAfGqSO9PuXPbWhoXWuCVFW9WMx3SZ0fHPo668qybNNgByh9JP84e3W6gxNIEShhJJfwRnfx8H4ko0JzlbO7jJk0LL17LKl36m1XYLGnduL21MiWUI4zquAiwqE5jMqcg2LW04o6M7V14cm7/AGeAV7Ni0udn3I0o8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=CzEyGD09; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="CzEyGD09" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e53bf9e60e4so1611673276.1 for ; Wed, 15 Jan 2025 22:44:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737009871; x=1737614671; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=K/GsOCEDOfRDT39FaEJegitlkPAxjzFaUxfveElqmTY=; b=CzEyGD09uAWJpk1Ap+BOWAINpF4eeoldxHSacHlPgMbppH7U6u8WUer/MTDPpCwrFu SSDO3CX2E5FcIT1kqBzOyCqybF0xqUuhCn3OUKZein9EeekzQUNOjNyKHfHNewxh5V3D +PRIQXpsZEhGjJj+2J4oVtDzo18+XHyNPU3dgzr1a+zdolusTy79uS6ov90by8M0gPIB rA4UHFigYEq3kOtraBCjQR8WCCyfrrYC+gji2IeL2DR/SVfEX5WDtxSBuaLp+LCtXRKt dxBYBmMeinwo5RkdZliFSmdzCOElTUkKS9BcvCnimcwF95LfEL2pQD04hZjx5jdjVhWq aIyg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737009871; x=1737614671; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=K/GsOCEDOfRDT39FaEJegitlkPAxjzFaUxfveElqmTY=; b=CnmbCEo8/beyZN3AMVZGDtbClVwR+PmRQgUT8wNdxM9Okb4JUQWvNEd+GOTFRSdfVR hbIqLU4pYwzR4PAmvlYnciAAPQP+ypmwOP5hI/yg4W1zOu2doO0zzjCjaNBh2ZOyj/4C N2x2t0cmPFgWTfVzwh69bg6RzEqM78+d379Oip5yJBXyyyiPrP/8N/CtWlTP2XWCzhj8 sJ0gLVPF10qwr664kWxyNJ3riRNDX4zgiSTE2FrE07VHvvaEZtK/Yl3iT09muCU/tGJ/ k4y8X1HbGqcIi+JXRUicy8RFI0HsS3u+lneQH3eiU2sjcgciazwn7V61QpVfigEFNwks AvBw== X-Forwarded-Encrypted: i=1; AJvYcCUJbBxs3fMiprVdRcMFkC5KN5S3uBXrzkedMj/4y8zx0jLGW2MLf9nw4yi2ususPaYW+1tBoLyJHNuVUaQ=@vger.kernel.org X-Gm-Message-State: AOJu0Yyr8zaXqyjA2FKkQxf9ayMfYguaoprXebq7M7LLYPDteFk4RIWz ALJ7Dx6/lQHtzbPclg/aespJ2C1nVbrECwW06KxfExbKq9CTwkNcxqFr5YRvLP+TH0tujb3AUFU trgPCYA== X-Google-Smtp-Source: AGHT+IHljZ5A20LtLRwCk5++hgLqvEhelnmQ8P2r6gjv+iQy2QnhHq5QW2SjeS8N/+bvo6aGE/GLBuUmtk23 X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:7bd:ce19:c701:bfa3]) (user=irogers job=sendgmr) by 2002:a81:a904:0:b0:6ef:70fb:459b with SMTP id 00721157ae682-6f530eea8a4mr723127b3.0.1737009871293; Wed, 15 Jan 2025 22:44:31 -0800 (PST) Date: Wed, 15 Jan 2025 22:43:43 -0800 In-Reply-To: <20250116064355.344245-1-irogers@google.com> Message-Id: <20250116064355.344245-12-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250116064355.344245-1-irogers@google.com> X-Mailer: git-send-email 2.48.0.rc2.279.g1de40edade-goog Subject: [PATCH v2 11/23] perf vendor events: Update/add Graniterapids events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.02 to v1.04. Add TMA metrics 5.01. Bring in the event updates v1.04: https://github.com/intel/perfmon/commit/79b9e512eab58641941a0b8d10ffe75914a= 87e17 https://github.com/intel/perfmon/commit/bc74a895e461b5ac720559da667e83a8fed= f7829 The TMA 5.01 addition is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Update uncore IIO events umask with the change: https://github.com/intel/perfmon/commit/d78e8a166537c9ceab4f2e901dc96c53667= a2174 which should address an issue originally raised by Michael Petlan: Reported-by: Michael Petlan Closes: https://lore.kernel.org/all/alpine.LRH.2.20.2401300733310.11354@Die= go/ Co-developed-by: Caleb Biggers Signed-off-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../arch/x86/graniterapids/cache.json | 110 +- .../arch/x86/graniterapids/counter.json | 2 +- .../arch/x86/graniterapids/frontend.json | 24 +- .../arch/x86/graniterapids/gnr-metrics.json | 2311 +++++++++++++++++ .../arch/x86/graniterapids/memory.json | 121 +- .../arch/x86/graniterapids/metricgroups.json | 145 ++ .../arch/x86/graniterapids/other.json | 99 + .../arch/x86/graniterapids/pipeline.json | 40 +- .../arch/x86/graniterapids/uncore-cache.json | 14 +- .../graniterapids/uncore-interconnect.json | 87 + .../arch/x86/graniterapids/uncore-io.json | 280 +- .../arch/x86/graniterapids/uncore-memory.json | 122 +- .../arch/x86/graniterapids/uncore-power.json | 98 + tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 14 files changed, 3212 insertions(+), 243 deletions(-) create mode 100644 tools/perf/pmu-events/arch/x86/graniterapids/gnr-metric= s.json create mode 100644 tools/perf/pmu-events/arch/x86/graniterapids/metricgrou= ps.json diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/cache.json b/tool= s/perf/pmu-events/arch/x86/graniterapids/cache.json index b56066274813..37cd575eee9e 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/cache.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/cache.json @@ -83,11 +83,11 @@ "UMask": "0x2" }, { - "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache when triggered by an L2 cache fill.", + "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache.", "Counter": "0,1,2,3", "EventCode": "0x26", "EventName": "L2_LINES_OUT.SILENT", - "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache when triggered by an L2 cache fill. These lines are ty= pically in Shared or Exclusive state. A non-threaded event.", + "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache. These lines are typically in Shared or Exclusive stat= e. A non-threaded event.", "SampleAfterValue": "200003", "UMask": "0x1" }, @@ -320,7 +320,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_LOADS", - "PEBS": "1", "PublicDescription": "Counts all retired load instructions. This e= vent accounts for SW prefetch instructions of PREFETCHNTA or PREFETCHT0/1/2= or PREFETCHW.", "SampleAfterValue": "1000003", "UMask": "0x81" @@ -331,7 +330,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_STORES", - "PEBS": "1", "PublicDescription": "Counts all retired store instructions.", "SampleAfterValue": "1000003", "UMask": "0x82" @@ -342,7 +340,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts all retired memory instructions - loa= ds and stores.", "SampleAfterValue": "1000003", "UMask": "0x83" @@ -353,7 +350,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.LOCK_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with locked= access.", "SampleAfterValue": "100007", "UMask": "0x21" @@ -364,7 +360,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions that split = across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x41" @@ -375,7 +370,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_STORES", - "PEBS": "1", "PublicDescription": "Counts retired store instructions that split= across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x42" @@ -386,7 +380,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_HIT_LOADS", - "PEBS": "1", "PublicDescription": "Number of retired load instructions with a c= lean hit in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x9" @@ -397,7 +390,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_HIT_STORES", - "PEBS": "1", "PublicDescription": "Number of retired store instructions that hi= t in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0xa" @@ -408,7 +400,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_LOADS", - "PEBS": "1", "PublicDescription": "Number of retired load instructions that (st= art a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x11" @@ -419,7 +410,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_STORES", - "PEBS": "1", "PublicDescription": "Number of retired store instructions that (s= tart a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x12" @@ -439,7 +429,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were HitM responses from shared L3.", "SampleAfterValue": "20011", "UMask": "0x4" @@ -450,7 +439,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS", - "PEBS": "1", "PublicDescription": "Counts the retired load instructions whose d= ata sources were L3 hit and cross-core snoop missed in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x1" @@ -461,7 +449,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NONE", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were hits in L3 without snoops required.", "SampleAfterValue": "100003", "UMask": "0x8" @@ -472,7 +459,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were L3 and cross-core snoop hits in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x2" @@ -483,7 +469,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM", - "PEBS": "1", "PublicDescription": "Retired load instructions which data sources= missed L3 but serviced from local DRAM.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -494,7 +479,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM", - "PEBS": "1", "SampleAfterValue": "1000003", "UMask": "0x2" }, @@ -504,7 +488,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD", - "PEBS": "1", "PublicDescription": "Retired load instructions whose data sources= was forwarded from a remote cache.", "SampleAfterValue": "100007", "UMask": "0x8" @@ -515,7 +498,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM", - "PEBS": "1", "SampleAfterValue": "1000003", "UMask": "0x4" }, @@ -525,7 +507,6 @@ "Data_LA": "1", "EventCode": "0xd4", "EventName": "MEM_LOAD_MISC_RETIRED.UC", - "PEBS": "1", "PublicDescription": "Retired instructions with at least one load = to uncacheable memory-type, or at least one cache-line split locked access = (Bus Lock).", "SampleAfterValue": "100007", "UMask": "0x4" @@ -536,7 +517,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.FB_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop was load missed in L1 but hit FB (Fill Buffers) due to preceding= miss to the same cache line with data not ready.", "SampleAfterValue": "100007", "UMask": "0x40" @@ -547,7 +527,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L1 data cache. This event includes all SW prefet= ches and lock instructions regardless of the data source.", "SampleAfterValue": "1000003", "UMask": "0x1" @@ -558,7 +537,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L1 cache.", "SampleAfterValue": "200003", "UMask": "0x8" @@ -569,7 +547,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with L2 cac= he hits as data sources.", "SampleAfterValue": "200003", "UMask": "0x2" @@ -580,7 +557,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions missed L2 c= ache as data sources.", "SampleAfterValue": "100021", "UMask": "0x10" @@ -591,7 +567,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L3 cache.", "SampleAfterValue": "100021", "UMask": "0x4" @@ -602,7 +577,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L3 cache.", "SampleAfterValue": "50021", "UMask": "0x20" @@ -664,6 +638,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads that resulted in a s= noop that hit in another core, which did not forward the data.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_NO_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x4003C0001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data reads that resulted in a s= noop hit in another core's caches which forwarded the unmodified data to th= e requesting core.", "Counter": "0,1,2,3", @@ -674,6 +658,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y a cache on a remote socket where a snoop hit a modified line in another c= ore's caches which forwarded the data.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.REMOTE_CACHE.SNOOP_HITM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1030000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y a cache on a remote socket where a snoop hit in another core's caches whi= ch forwarded the unmodified data to the requesting core.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.REMOTE_CACHE.SNOOP_HIT_WITH_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x830000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that hit in= the L3 or were snooped from another core's caches on the same socket.", "Counter": "0,1,2,3", @@ -704,6 +708,56 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that resulted in a snoop hit a modified line in another core's caches w= hich forwarded the data.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.L3_HIT.SNOOP_HITM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10003C4477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by a cache on a remote socket where a snoop was sent= and data was returned (Modified or Not Modified).", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.REMOTE_CACHE.SNOOP_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1830004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by a cache on a remote socket where a snoop hit a mo= dified line in another core's caches which forwarded the data.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.REMOTE_CACHE.SNOOP_HITM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1030004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by a cache on a remote socket where a snoop hit in a= nother core's caches which forwarded the unmodified data to the requesting = core.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.REMOTE_CACHE.SNOOP_HIT_WITH_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x830004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand reads for ownership (RFO), hard= ware prefetch RFOs (which bring data to L2), and software prefetches for ex= clusive ownership (PREFETCHW) that hit to a (M)odified cacheline in the L3 = or snoop filter.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.RFO_TO_CORE.L3_HIT_M", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1F80040022", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Any memory transaction that reached the SQ.", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/counter.json b/to= ols/perf/pmu-events/arch/x86/graniterapids/counter.json index 250781a8ca64..4e95dd19aa19 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/counter.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/counter.json @@ -52,7 +52,7 @@ { "Unit": "PCU", "CountersNumFixed": "0", - "CountersNumGeneric": 4 + "CountersNumGeneric": "4" }, { "Unit": "CHACMS", diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/frontend.json b/t= ools/perf/pmu-events/arch/x86/graniterapids/frontend.json index 663c1a0e55a2..dc81055941b1 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/frontend.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/frontend.json @@ -41,7 +41,6 @@ "EventName": "FRONTEND_RETIRED.ANY_ANT", "MSRIndex": "0x3F7", "MSRValue": "0x9", - "PEBS": "1", "PublicDescription": "Always Not Taken (ANT) conditional retired b= ranches (no BTB entry and not mispredicted)", "SampleAfterValue": "100007", "UMask": "0x3" @@ -53,7 +52,6 @@ "EventName": "FRONTEND_RETIRED.ANY_DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x1", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= DSB (Decode stream buffer i.e. the decoded instruction-cache) miss.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -65,7 +63,6 @@ "EventName": "FRONTEND_RETIRED.DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x11", - "PEBS": "1", "PublicDescription": "Number of retired Instructions that experien= ced a critical DSB (Decode stream buffer i.e. the decoded instruction-cache= ) miss. Critical means stalls were exposed to the back-end as a result of t= he DSB miss.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -77,7 +74,6 @@ "EventName": "FRONTEND_RETIRED.ITLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x14", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= iTLB (Instruction TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -89,7 +85,6 @@ "EventName": "FRONTEND_RETIRED.L1I_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x12", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L1 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -101,7 +96,6 @@ "EventName": "FRONTEND_RETIRED.L2_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x13", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L2 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -113,7 +107,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x600106", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 1 cycle which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -125,7 +118,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_128", "MSRIndex": "0x3F7", "MSRValue": "0x608006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 12= 8 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -137,7 +129,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_16", "MSRIndex": "0x3F7", "MSRValue": "0x601006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 16 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -149,7 +140,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2", "MSRIndex": "0x3F7", "MSRValue": "0x600206", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -161,7 +151,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_256", "MSRIndex": "0x3F7", "MSRValue": "0x610006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 25= 6 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -173,7 +162,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x100206", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after the front-end had at least 1 bubble-slot for a per= iod of 2 cycles. A bubble-slot is an empty issue-pipeline slot while there = was no RAT stall.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -185,7 +173,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_32", "MSRIndex": "0x3F7", "MSRValue": "0x602006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 32 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -197,7 +184,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_4", "MSRIndex": "0x3F7", "MSRValue": "0x600406", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 4 = cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -209,7 +195,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_512", "MSRIndex": "0x3F7", "MSRValue": "0x620006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 51= 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -221,7 +206,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_64", "MSRIndex": "0x3F7", "MSRValue": "0x604006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 64= cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -233,7 +217,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_8", "MSRIndex": "0x3F7", "MSRValue": "0x600806", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 8 cycles. During thi= s period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -244,8 +227,7 @@ "EventCode": "0xc6", "EventName": "FRONTEND_RETIRED.LATE_SWPF", "MSRIndex": "0x3F7", - "MSRValue": "0x9", - "PEBS": "1", + "MSRValue": "0xA", "PublicDescription": "Number of Instruction Cache demand miss in s= hadow of an on-going i-fetch cache-line triggered by PREFETCHIT0/1 instruct= ions", "SampleAfterValue": "100007", "UMask": "0x3" @@ -257,7 +239,6 @@ "EventName": "FRONTEND_RETIRED.MISP_ANT", "MSRIndex": "0x3F7", "MSRValue": "0x9", - "PEBS": "1", "PublicDescription": "ANT retired branches that got just mispredic= ted", "SampleAfterValue": "100007", "UMask": "0x2" @@ -269,7 +250,6 @@ "EventName": "FRONTEND_RETIRED.MS_FLOWS", "MSRIndex": "0x3F7", "MSRValue": "0x8", - "PEBS": "1", "SampleAfterValue": "100007", "UMask": "0x3" }, @@ -280,7 +260,6 @@ "EventName": "FRONTEND_RETIRED.STLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x15", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= STLB (2nd level TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x3" @@ -292,7 +271,6 @@ "EventName": "FRONTEND_RETIRED.UNKNOWN_BRANCH", "MSRIndex": "0x3F7", "MSRValue": "0x17", - "PEBS": "1", "SampleAfterValue": "100007", "UMask": "0x3" }, diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/gnr-metrics.json = b/tools/perf/pmu-events/arch/x86/graniterapids/gnr-metrics.json new file mode 100644 index 000000000000..2cccb7070d44 --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/graniterapids/gnr-metrics.json @@ -0,0 +1,2311 @@ +[ + { + "BriefDescription": "C1 residency percent per core", + "MetricExpr": "cstate_core@c1\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C1_Core_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C2 residency percent per package", + "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C2_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C6 residency percent per core", + "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Core_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C6 residency percent per package", + "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Uncore frequency per die [GHZ]", + "MetricExpr": "tma_info_system_socket_clks / #num_dies / duration_= time / 1e9", + "MetricGroup": "SoC", + "MetricName": "UNCORE_FREQ" + }, + { + "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY", + "MetricName": "cpi", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "Percent of time that cores are in cstate C0 a= s observed by the power control unit (PCU)", + "MetricExpr": "100 * (UNC_P_POWER_STATE_OCCUPANCY_CORES_C0 / (pcu_= 0@UNC_P_CLOCKTICKS@ / #num_packages) / (#num_cores / #num_packages * #num_p= ackages))", + "MetricName": "cpu_cstate_c0" + }, + { + "BriefDescription": "Percent of time that cores are in cstate C6 a= s observed by the power control unit (PCU)", + "MetricExpr": "100 * (UNC_P_POWER_STATE_OCCUPANCY_CORES_C6 / (pcu_= 0@UNC_P_CLOCKTICKS@ / #num_packages) / (#num_cores / #num_packages * #num_p= ackages))", + "MetricName": "cpu_cstate_c6" + }, + { + "BriefDescription": "CPU operating frequency (in GHz)", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD / CPU_CLK_UNHALTED.REF_TSC = * #SYSTEM_TSC_FREQ / 1e9", + "MetricName": "cpu_operating_frequency", + "ScaleUnit": "1GHz" + }, + { + "BriefDescription": "Percentage of time spent in the active CPU po= wer state C0", + "MetricExpr": "tma_info_system_cpus_utilized", + "MetricName": "cpu_utilization", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data loads to the total number of complete= d instructions", + "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", + "MetricName": "dtlb_2nd_level_load_mpi", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data stores to the total number of complet= ed instructions", + "MetricExpr": "DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", + "MetricName": "dtlb_2nd_level_store_mpi", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "Bandwidth observed by the integrated I/O traf= fic controller (IIO) of IO reads that are initiated by end device controlle= rs that are requesting memory from the CPU", + "MetricExpr": "UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.ALL_PARTS * 4 / 1e= 6 / duration_time", + "MetricName": "iio_bandwidth_read", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "Bandwidth observed by the integrated I/O traf= fic controller (IIO) of IO writes that are initiated by end device controll= ers that are writing memory to the CPU", + "MetricExpr": "UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.ALL_PARTS * 4 / 1= e6 / duration_time", + "MetricName": "iio_bandwidth_write", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU", + "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e6 / durati= on_time", + "MetricName": "io_bandwidth_read", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the local CPU socket", + "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR_LOCAL * 64 / 1e6 / = duration_time", + "MetricName": "io_bandwidth_read_local", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from a remote CPU socket", + "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR_REMOTE * 64 / 1e6 /= duration_time", + "MetricName": "io_bandwidth_read_remote", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU", + "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM + UNC_CHA_TOR_INSERTS.= IO_ITOMCACHENEAR) * 64 / 1e6 / duration_time", + "MetricName": "io_bandwidth_write", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the local CPU socket", + "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM_LOCAL + UNC_CHA_TOR_IN= SERTS.IO_ITOMCACHENEAR_LOCAL) * 64 / 1e6 / duration_time", + "MetricName": "io_bandwidth_write_local", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to a remote CPU socket", + "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM_REMOTE + UNC_CHA_TOR_I= NSERTS.IO_ITOMCACHENEAR_REMOTE) * 64 / 1e6 / duration_time", + "MetricName": "io_bandwidth_write_remote", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by a code fetch to the total number of completed ins= tructions", + "MetricExpr": "ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", + "MetricName": "itlb_2nd_level_mpi", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "Ratio of number of code read requests missing= in L1 instruction cache (includes prefetches) to the total number of compl= eted instructions", + "MetricExpr": "L2_RQSTS.ALL_CODE_RD / INST_RETIRED.ANY", + "MetricName": "l1_i_code_read_misses_with_prefetches_per_instr", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "Ratio of number of demand load requests hitti= ng in L1 data cache to the total number of completed instructions", + "MetricExpr": "MEM_LOAD_RETIRED.L1_HIT / INST_RETIRED.ANY", + "MetricName": "l1d_demand_data_read_hits_per_instr", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "Ratio of number of requests missing L1 data c= ache (includes data+rfo w/ prefetches) to the total number of completed ins= tructions", + "MetricExpr": "L1D.REPLACEMENT / INST_RETIRED.ANY", + "MetricName": "l1d_mpi", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "Ratio of number of code read request missing = L2 cache to the total number of completed instructions", + "MetricExpr": "L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", + "MetricName": "l2_demand_code_mpi", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "Ratio of number of completed demand load requ= ests hitting in L2 cache to the total number of completed instructions", + "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT / INST_RETIRED.ANY", + "MetricName": "l2_demand_data_read_hits_per_instr", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "Ratio of number of completed data read reques= t missing L2 cache to the total number of completed instructions", + "MetricExpr": "MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricName": "l2_demand_data_read_mpi", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "Ratio of number of requests missing L2 cache = (includes code+data+rfo w/ prefetches) to the total number of completed ins= tructions", + "MetricExpr": "L2_LINES_IN.ALL / INST_RETIRED.ANY", + "MetricName": "l2_mpi", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "Ratio of number of code read requests missing= last level core cache (includes demand w/ prefetches) to the total number = of completed instructions", + "MetricExpr": "UNC_CHA_TOR_INSERTS.IA_MISS_CRD / INST_RETIRED.ANY", + "MetricName": "llc_code_read_mpi_demand_plus_prefetch", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "Ratio of number of data read requests missing= last level core cache (includes demand w/ prefetches) to the total number = of completed instructions", + "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_LLCPREFDATA + UNC_CHA_= TOR_INSERTS.IA_MISS_DRD + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF) / INST_RETI= RED.ANY", + "MetricName": "llc_data_read_mpi_demand_plus_prefetch", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "Average latency of a last level cache (LLC) d= emand data read miss (read memory access) in nano seconds", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_= TOR_INSERTS.IA_MISS_DRD) / (UNC_CHA_CLOCKTICKS / (source_count(UNC_CHA_TOR_= OCCUPANCY.IA_MISS_DRD) * #num_packages)) * duration_time", + "MetricName": "llc_demand_data_read_miss_latency", + "ScaleUnit": "1ns" + }, + { + "BriefDescription": "Average latency of a last level cache (LLC) d= emand data read miss (read memory access) addressed to local memory in nano= seconds", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_LOCAL / UN= C_CHA_TOR_INSERTS.IA_MISS_DRD_LOCAL) / (UNC_CHA_CLOCKTICKS / (source_count(= UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_LOCAL) * #num_packages)) * duration_time", + "MetricName": "llc_demand_data_read_miss_latency_for_local_request= s", + "ScaleUnit": "1ns" + }, + { + "BriefDescription": "Average latency of a last level cache (LLC) d= emand data read miss (read memory access) addressed to remote memory in nan= o seconds", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_REMOTE / U= NC_CHA_TOR_INSERTS.IA_MISS_DRD_REMOTE) / (UNC_CHA_CLOCKTICKS / (source_coun= t(UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_REMOTE) * #num_packages)) * duration_ti= me", + "MetricName": "llc_demand_data_read_miss_latency_for_remote_reques= ts", + "ScaleUnit": "1ns" + }, + { + "BriefDescription": "Average latency of a last level cache (LLC) d= emand data read miss (read memory access) addressed to DRAM in nano seconds= ", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR / UNC_= CHA_TOR_INSERTS.IA_MISS_DRD_DDR) / (UNC_CHA_CLOCKTICKS / (source_count(UNC_= CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR) * #num_packages)) * duration_time", + "MetricName": "llc_demand_data_read_miss_to_dram_latency", + "ScaleUnit": "1ns" + }, + { + "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to local memory", + "MetricExpr": "UNC_CHA_REQUESTS.READS_LOCAL * 64 / 1e6 / duration_= time", + "MetricName": "llc_miss_local_memory_bandwidth_read", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to local memory", + "MetricExpr": "UNC_CHA_REQUESTS.WRITES_LOCAL * 64 / 1e6 / duration= _time", + "MetricName": "llc_miss_local_memory_bandwidth_write", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to remote memory", + "MetricExpr": "UNC_CHA_REQUESTS.READS_REMOTE * 64 / 1e6 / duration= _time", + "MetricName": "llc_miss_remote_memory_bandwidth_read", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to remote memory", + "MetricExpr": "UNC_CHA_REQUESTS.WRITES_REMOTE * 64 / 1e6 / duratio= n_time", + "MetricName": "llc_miss_remote_memory_bandwidth_write", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "The ratio of number of completed memory load = instructions to the total number completed instructions", + "MetricExpr": "MEM_INST_RETIRED.ALL_LOADS / INST_RETIRED.ANY", + "MetricName": "loads_per_instr", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "DDR memory read bandwidth (MB/sec)", + "MetricExpr": "(UNC_M_CAS_COUNT_SCH0.RD + UNC_M_CAS_COUNT_SCH1.RD)= * 64 / 1e6 / duration_time", + "MetricName": "memory_bandwidth_read", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "DDR memory bandwidth (MB/sec)", + "MetricExpr": "(UNC_M_CAS_COUNT_SCH0.RD + UNC_M_CAS_COUNT_SCH1.RD = + UNC_M_CAS_COUNT_SCH0.WR + UNC_M_CAS_COUNT_SCH1.WR) * 64 / 1e6 / duration_= time", + "MetricName": "memory_bandwidth_total", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "DDR memory write bandwidth (MB/sec)", + "MetricExpr": "(UNC_M_CAS_COUNT_SCH0.WR + UNC_M_CAS_COUNT_SCH1.WR)= * 64 / 1e6 / duration_time", + "MetricName": "memory_bandwidth_write", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches", + "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCAL + UNC_CHA_TO= R_INSERTS.IA_MISS_DRD_PREF_LOCAL) / (UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCAL = + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_LOCAL + UNC_CHA_TOR_INSERTS.IA_MISS_= DRD_REMOTE + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_REMOTE)", + "MetricName": "numa_reads_addressed_to_local_dram", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches", + "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_DRD_REMOTE + UNC_CHA_T= OR_INSERTS.IA_MISS_DRD_PREF_REMOTE) / (UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCA= L + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_LOCAL + UNC_CHA_TOR_INSERTS.IA_MIS= S_DRD_REMOTE + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_REMOTE)", + "MetricName": "numa_reads_addressed_to_remote_dram", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Uops delivered from decoded instruction cache= (decoded stream buffer or DSB) as a percent of total uops delivered to Ins= truction Decode Queue", + "MetricExpr": "IDQ.DSB_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ.= MS_UOPS + LSD.UOPS)", + "MetricName": "percent_uops_delivered_from_decoded_icache", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Uops delivered from legacy decode pipeline (M= icro-instruction Translation Engine or MITE) as a percent of total uops del= ivered to Instruction Decode Queue", + "MetricExpr": "IDQ.MITE_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ= .MS_UOPS + LSD.UOPS)", + "MetricName": "percent_uops_delivered_from_legacy_decode_pipeline", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Uops delivered from microcode sequencer (MS) = as a percent of total uops delivered to Instruction Decode Queue", + "MetricExpr": "IDQ.MS_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ.M= S_UOPS + LSD.UOPS)", + "MetricName": "percent_uops_delivered_from_microcode_sequencer", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Percentage of cycles spent in System Manageme= nt Interrupts.", + "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0= else 0)", + "MetricGroup": "smi", + "MetricName": "smi_cycles", + "MetricThreshold": "smi_cycles > 0.1", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Number of SMI interrupts.", + "MetricExpr": "msr@smi@", + "MetricGroup": "smi", + "MetricName": "smi_num", + "ScaleUnit": "1SMI#" + }, + { + "BriefDescription": "The ratio of number of completed memory store= instructions to the total number completed instructions", + "MetricExpr": "MEM_INST_RETIRED.ALL_STORES / INST_RETIRED.ANY", + "MetricName": "stores_per_instr", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + = UOPS_DISPATCHED.PORT_5_11 + UOPS_DISPATCHED.PORT_6) / (5 * tma_info_core_co= re_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.4", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the Advanced Matrix eXtensions (AMX) execution engine was busy with tile = (arithmetic) operations", + "MetricExpr": "EXE.AMX_BUSY / tma_info_core_core_clks", + "MetricGroup": "BvCB;Compute;HPC;Server;TopdownL3;tma_L3_group;tma= _core_bound_group", + "MetricName": "tma_amx_busy", + "MetricThreshold": "tma_amx_busy > 0.5 & tma_core_bound > 0.1 & tm= a_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops delivered by the Microcode_Sequencer as a result of Assists= ", + "MetricExpr": "78 * ASSISTS.ANY / tma_info_thread_slots", + "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists", + "MetricExpr": "63 * ASSISTS.SSE_AVX_MIX / tma_info_thread_slots", + "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", + "MetricName": "tma_avx_assists", + "MetricThreshold": "tma_avx_assists > 0.1", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", + "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "BvOB;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.2", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This category represents fraction of slots wa= sted due to incorrect speculations", + "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + t= ma_retiring), 0)", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Ret", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_fb_full)= ))", + "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_= bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_d= ram_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtlb_load + tma_st= ore_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_load= s + tma_fb_full)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_= l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_split_l= oads / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma= _lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bound * (tma_s= tore_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_split_stores / (tma_store_latency + tma_false_sha= ring + tma_split_stores + tma_streaming_stores + tma_dtlb_store)) + tma_mem= ory_bound * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound = + tma_dram_bound + tma_store_bound)) * (tma_store_latency / (tma_store_late= ncy + tma_false_sharing + tma_split_stores + tma_streaming_stores + tma_dtl= b_store)))", + "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_amx_busy + tma_ports_utilization) + tma_c= ore_bound * tma_amx_busy / (tma_divider + tma_serializing_operation + tma_a= mx_busy + tma_ports_utilization) + tma_core_bound * (tma_ports_utilization = / (tma_divider + tma_serializing_operation + tma_amx_busy + tma_ports_utili= zation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_utili= zed_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.MS\\,cmask\\=3D0x1@) * (tma_fe= tch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_resteers= + tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispredict= s) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_branches= )) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_sw= itches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms / (tma_= mite + tma_dsb + tma_ms))) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Fed;FetchBW;Frontend", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu@UOPS_R= ETIRED.MS\\,cmask\\=3D0x1@) * (tma_fetch_latency * (tma_ms_switches + tma_b= ranch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma_othe= r_mispredicts / tma_branch_mispredicts) / (tma_mispredicts_resteers + tma_c= lears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_itlb_mis= ses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) += tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_ms)) + 10 * tma_m= icrocode_sequencer * tma_other_mispredicts / tma_branch_mispredicts * tma_b= ranch_mispredicts + tma_machine_clears * tma_other_nukes / tma_other_nukes = + tma_core_bound * (tma_serializing_operation + RS.EMPTY_RESOURCE / tma_inf= o_thread_clks * tma_ports_utilized_0) / (tma_divider + tma_serializing_oper= ation + tma_amx_busy + tma_ports_utilization) + tma_microcode_sequencer / (= tma_few_uops_instructions + tma_microcode_sequencer) * (tma_assists / tma_m= icrocode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (= tma_dtlb_load / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_depende= ncy + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bound= * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dra= m_bound + tma_store_bound)) * (tma_dtlb_store / (tma_store_latency + tma_fa= lse_sharing + tma_split_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) * tma_remote_cach= e / (tma_local_mem + tma_remote_mem + tma_remote_cache) + tma_l3_bound / (t= ma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_boun= d) * (tma_contested_accesses + tma_data_sharing) / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / = (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bo= und) * tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_spl= it_stores + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + t= ma_machine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears, tma_remote_cache" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Offcore", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Ret", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", + "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * sl= ots", + "MetricGroup": "BadSpec;BrMispredicts;BvMP;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueBM", + "MetricName": "tma_branch_mispredicts", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost,= tma_mispredicts_resteers", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers", + "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C01 / tma_info_thread_clks", + "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", + "MetricName": "tma_c01_wait", + "MetricThreshold": "tma_c01_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C02 / tma_info_thread_clks", + "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", + "MetricName": "tma_c02_wait", + "MetricThreshold": "tma_c02_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU retired uops originated from CISC (complex instruction set computer) in= struction", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Machine Clears", + "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", + "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", + "MetricName": "tma_clears_resteers", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "MetricExpr": "max(0, FRONTEND_RETIRED.L1I_MISS * FRONTEND_RETIRED= .L1I_MISS:R / tma_info_thread_clks - tma_code_l2_miss)", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_hit", + "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "MetricExpr": "FRONTEND_RETIRED.L2_MISS * FRONTEND_RETIRED.L2_MISS= :R / tma_info_thread_clks", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_miss", + "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, FRONTEND_RETIRED.ITLB_MISS * FRONTEND_RETIRE= D.ITLB_MISS:R / tma_info_thread_clks - tma_code_stlb_miss)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "FRONTEND_RETIRED.STLB_MISS * FRONTEND_RETIRED.STLB_= MISS:R / tma_info_thread_clks", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks * IT= LB_MISSES.WALK_COMPLETED_2M_4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISS= ES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks * IT= LB_MISSES.WALK_COMPLETED_4K / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.= WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by non-taken conditional bran= ches", + "MetricExpr": "BR_MISP_RETIRED.COND_NTAKEN_COST * BR_MISP_RETIRED.= COND_NTAKEN_COST:R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_cond_nt_mispredicts", + "MetricThreshold": "tma_cond_nt_mispredicts > 0.05 & tma_branch_mi= spredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by taken conditional branches", + "MetricExpr": "BR_MISP_RETIRED.COND_TAKEN_COST * BR_MISP_RETIRED.C= OND_TAKEN_COST:R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_cond_tk_mispredicts", + "MetricThreshold": "tma_cond_tk_mispredicts > 0.05 & tma_branch_mi= spredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", + "MetricExpr": "((min(MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS * MEM_LOAD_= L3_HIT_RETIRED.XSNP_MISS:R, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS * (79 * tma_i= nfo_system_core_frequency) - 4.4 * tma_info_system_core_frequency) if 0 < M= EM_LOAD_L3_HIT_RETIRED.XSNP_MISS:R else MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS *= (79 * tma_info_system_core_frequency) - 4.4 * tma_info_system_core_frequen= cy) + (min(MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * MEM_LOAD_L3_HIT_RETIRED.XSNP_= FWD:R, MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (81 * tma_info_system_core_freque= ncy) - 4.4 * tma_info_system_core_frequency) if 0 < MEM_LOAD_L3_HIT_RETIRED= .XSNP_FWD:R else MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (81 * tma_info_system_c= ore_frequency) - 4.4 * tma_info_system_core_frequency) * (OCR.DEMAND_DATA_R= D.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DA= TA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOA= D_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_contested_accesses", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false= _sharing, tma_machine_clears, tma_remote_cache", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e Core non-memory issues were of a bottleneck", + "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)", + "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_b= ackend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", + "MetricExpr": "((min(MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD * MEM_LOA= D_L3_HIT_RETIRED.XSNP_NO_FWD:R, MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD * (79 *= tma_info_system_core_frequency) - 4.4 * tma_info_system_core_frequency) if= 0 < MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD:R else MEM_LOAD_L3_HIT_RETIRED.XSN= P_NO_FWD * (79 * tma_info_system_core_frequency) - 4.4 * tma_info_system_co= re_frequency) + (min(MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * MEM_LOAD_L3_HIT_RET= IRED.XSNP_FWD:R, MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (79 * tma_info_system_c= ore_frequency) - 4.4 * tma_info_system_core_frequency) if 0 < MEM_LOAD_L3_H= IT_RETIRED.XSNP_FWD:R else MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (79 * tma_inf= o_system_core_frequency) - 4.4 * tma_info_system_core_frequency) * (1 - OCR= .DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM += OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB= _HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears, = tma_remote_cache", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", + "MetricName": "tma_decoder0_alone", + "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Divider unit was active", + "MetricExpr": "ARITH.DIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIV_ACTIVE", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled on accesses to external memory (DRAM) by loads", + "MetricExpr": "MEMORY_ACTIVITY.STALLS_L3_MISS / tma_info_thread_cl= ks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to DSB (decoded uop cache) fetch pipe= line", + "MetricExpr": "(IDQ.DSB_CYCLES_ANY - IDQ.DSB_CYCLES_OK) / tma_info= _core_core_clks / 2", + "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", + "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricExpr": "(min(MEM_INST_RETIRED.STLB_HIT_LOADS * MEM_INST_RET= IRED.STLB_HIT_LOADS:R, MEM_INST_RETIRED.STLB_HIT_LOADS * 7) if 0 < MEM_INST= _RETIRED.STLB_HIT_LOADS:R else MEM_INST_RETIRED.STLB_HIT_LOADS * 7) / tma_i= nfo_thread_clks + tma_load_stlb_miss", + "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", + "MetricExpr": "(min(MEM_INST_RETIRED.STLB_HIT_STORES * MEM_INST_RE= TIRED.STLB_HIT_STORES:R, MEM_INST_RETIRED.STLB_HIT_STORES * 7) if 0 < MEM_I= NST_RETIRED.STLB_HIT_STORES:R else MEM_INST_RETIRED.STLB_HIT_STORES * 7) / = tma_info_thread_clks + tma_store_stlb_miss", + "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", + "MetricExpr": "(170 * tma_info_system_core_frequency * cpu@OCR.DEM= AND_RFO.L3_MISS\\,offcore_rsp\\=3D0x103b800002@ + 81 * tma_info_system_core= _frequency * OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM) / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", + "MetricName": "tma_false_sharing", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cac= he", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", + "MetricExpr": "L1D_PEND_MISS.FB_FULL / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy, tma_streaming_stores", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend bandwidth issues", + "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)", + "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_= frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.2", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend latency issues", + "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.U= OP_DROPPING / tma_info_thread_slots", + "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend= _bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", + "MetricExpr": "max(0, tma_heavy_operations - tma_microcode_sequenc= er)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", + "MetricName": "tma_few_uops_instructions", + "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions.= Related metrics: tma_decoder0_alone", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents overall arithmetic flo= ating-point (FP) operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", + "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", + "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Floating Point (FP) Assists", + "MetricExpr": "30 * ASSISTS.FP / tma_info_thread_slots", + "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", + "MetricName": "tma_fp_assists", + "MetricThreshold": "tma_fp_assists > 0.1", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "MetricExpr": "ARITH.FPDIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_fp_divider", + "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) scalar uops fraction the CPU has retired", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED2.SCALAR) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", + "MetricName": "tma_fp_scalar", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_2= 56b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) vector uops fraction the CPU has retired aggregated across all v= ector widths", + "MetricExpr": "(FP_ARITH_INST_RETIRED.VECTOR + FP_ARITH_INST_RETIR= ED2.VECTOR) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", + "MetricName": "tma_fp_vector", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6= , tma_ports_utilized_2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 128-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.128B_PACKED_HALF= ) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", + "MetricName": "tma_fp_vector_128b", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized= _2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 256-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.256B_PACKED_HALF= ) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", + "MetricName": "tma_fp_vector_256b", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized= _2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 512-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.512B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.512B_PACKED_HALF= ) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", + "MetricName": "tma_fp_vector_512b", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This category represents fraction of slots wh= ere the processor's Frontend undersupplies its Backend", + "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UO= P_DROPPING / tma_info_thread_slots", + "MetricGroup": "BvFB;BvIO;PGO;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.MACRO_FUSED / (= tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", + "MetricName": "tma_fused_instructions", + "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+]). Sample with: UOPS_RET= IRED.HEAVY", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", + "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", + "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect CALL instructions= ", + "MetricExpr": "BR_MISP_RETIRED.INDIRECT_CALL_COST * BR_MISP_RETIRE= D.INDIRECT_CALL_COST:R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_ind_call_mispredicts", + "MetricThreshold": "tma_ind_call_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect JMP instructions", + "MetricExpr": "max((BR_MISP_RETIRED.INDIRECT_COST * BR_MISP_RETIRE= D.INDIRECT_COST:R - BR_MISP_RETIRED.INDIRECT_CALL_COST * BR_MISP_RETIRED.IN= DIRECT_CALL_COST:R) / tma_info_thread_clks, 0)", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_ind_jump_mispredicts", + "MetricThreshold": "tma_ind_jump_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 6 / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "MetricGroup": "Bad;BrMispredicts;tma_issueBM", + "MetricName": "tma_info_bad_spec_branch_misprediction_cost", + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", + "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", + "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_taken < 200" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_indirect", + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_ret", + "MetricThreshold": "tma_info_bad_spec_ipmisp_ret < 500" + }, + { + "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmispredict", + "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200" + }, + { + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", + "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", + "MetricGroup": "BrMispredicts", + "MetricName": "tma_info_bad_spec_spec_clears_ratio" + }, + { + "BriefDescription": "Probability of Core Bound bottleneck hidden b= y SMT-profiling artifacts", + "MetricExpr": "(100 * (1 - tma_core_bound / tma_ports_utilization = if tma_core_bound < tma_ports_utilization else 1) if tma_info_system_smt_2t= _utilization > 0.5 else 0)", + "MetricGroup": "Cor;SMT", + "MetricName": "tma_info_botlnk_l0_core_bound_likely", + "MetricThreshold": "tma_info_botlnk_l0_core_bound_likely > 0.5" + }, + { + "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_ms)))", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", + "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp" + }, + { + "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_ms))", + "MetricGroup": "DSBmiss;Fed;tma_issueFB", + "MetricName": "tma_info_botlnk_l2_dsb_misses", + "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) misse= s - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb= _switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp" + }, + { + "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", + "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", + "MetricName": "tma_info_botlnk_l2_ic_misses", + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5" + }, + { + "BriefDescription": "Fraction of branches that are CALL or RET", + "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_R= ETURN) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_branches_callret" + }, + { + "BriefDescription": "Fraction of branches that are non-taken condi= tionals", + "MetricExpr": "BR_INST_RETIRED.COND_NTAKEN / BR_INST_RETIRED.ALL_B= RANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_branches_cond_nt" + }, + { + "BriefDescription": "Fraction of branches that are taken condition= als", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN / BR_INST_RETIRED.ALL_BR= ANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_branches_cond_tk" + }, + { + "BriefDescription": "Fraction of branches that are unconditional (= direct or indirect) jumps", + "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_= TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_branches_jump" + }, + { + "BriefDescription": "Fraction of branches of other types (not indi= vidually covered by other metrics in Info.Branches group)", + "MetricExpr": "1 - (tma_info_branches_cond_nt + tma_info_branches_= cond_tk + tma_info_branches_callret + tma_info_branches_jump)", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_branches_other_branches" + }, + { + "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", + "MetricExpr": "(CPU_CLK_UNHALTED.DISTRIBUTED if #SMT_on else tma_i= nfo_thread_clks)", + "MetricGroup": "SMT", + "MetricName": "tma_info_core_core_clks" + }, + { + "BriefDescription": "Instructions Per Cycle across hyper-threads (= per physical core)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_core_core_clks", + "MetricGroup": "Ret;SMT;TmaL1;tma_L1_group", + "MetricName": "tma_info_core_coreipc" + }, + { + "BriefDescription": "uops Executed per Cycle", + "MetricExpr": "UOPS_EXECUTED.THREAD / tma_info_thread_clks", + "MetricGroup": "Power", + "MetricName": "tma_info_core_epc" + }, + { + "BriefDescription": "Floating Point Operations Per Cycle", + "MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\= =3D0x03@ + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * cpu@FP_ARITH_= INST_RETIRED.128B_PACKED_SINGLE\\,umask\\=3D0x18@ + 8 * cpu@FP_ARITH_INST_R= ETIRED.256B_PACKED_SINGLE\\,umask\\=3D0x60@ + 16 * FP_ARITH_INST_RETIRED.51= 2B_PACKED_SINGLE) / tma_info_core_core_clks", + "MetricGroup": "Flops;Ret", + "MetricName": "tma_info_core_flopc" + }, + { + "BriefDescription": "Actual per-core usage of the Floating Point n= on-X87 execution units (regardless of precision or vector-width)", + "MetricExpr": "(FP_ARITH_DISPATCHED.PORT_0 + FP_ARITH_DISPATCHED.P= ORT_1 + FP_ARITH_DISPATCHED.PORT_5) / (2 * tma_info_core_core_clks)", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_core_fp_arith_utilization", + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" + }, + { + "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", + "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", + "MetricName": "tma_info_core_ilp" + }, + { + "BriefDescription": "Fraction of Uops delivered by the DSB (aka De= coded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_frontend_dsb_coverage", + "MetricThreshold": "tma_info_frontend_dsb_coverage < 0.7 & tma_inf= o_thread_ipc / 6 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp" + }, + { + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricGroup": "DSBmiss", + "MetricName": "tma_info_frontend_dsb_switch_cost" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired DSB misses", + "MetricExpr": "FRONTEND_RETIRED.ANY_DSB_MISS * FRONTEND_RETIRED.AN= Y_DSB_MISS:R / tma_info_thread_clks", + "MetricGroup": "DSBmiss;Fed;FetchLat", + "MetricName": "tma_info_frontend_dsb_switches_ret", + "MetricThreshold": "tma_info_frontend_dsb_switches_ret > 0.05" + }, + { + "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_frontend_fetch_upc" + }, + { + "BriefDescription": "Average Latency for L1 instruction cache miss= es", + "MetricExpr": "ICACHE_DATA.STALLS / ICACHE_DATA.STALL_PERIODS", + "MetricGroup": "Fed;FetchLat;IcMiss", + "MetricName": "tma_info_frontend_icache_miss_latency" + }, + { + "BriefDescription": "Instructions per non-speculative DSB miss (lo= wer number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", + "MetricGroup": "DSBmiss;Fed", + "MetricName": "tma_info_frontend_ipdsb_miss_ret", + "MetricThreshold": "tma_info_frontend_ipdsb_miss_ret < 50" + }, + { + "BriefDescription": "Instructions per speculative Unknown Branch M= isprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_inst_mix_instructions / BACLEARS.ANY", + "MetricGroup": "Fed", + "MetricName": "tma_info_frontend_ipunknown_branch" + }, + { + "BriefDescription": "L2 cache true code cacheline misses per kilo = instruction", + "MetricExpr": "1e3 * FRONTEND_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_frontend_l2mpki_code" + }, + { + "BriefDescription": "L2 cache speculative code cacheline misses pe= r kilo instruction", + "MetricExpr": "1e3 * L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_frontend_l2mpki_code_all" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired operations that invoke th= e Microcode Sequencer", + "MetricExpr": "FRONTEND_RETIRED.MS_FLOWS * FRONTEND_RETIRED.MS_FLO= WS:R / tma_info_thread_clks", + "MetricGroup": "Fed;FetchLat;MicroSeq", + "MetricName": "tma_info_frontend_ms_latency_ret", + "MetricThreshold": "tma_info_frontend_ms_latency_ret > 0.05" + }, + { + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, + { + "BriefDescription": "Average number of cycles the front-end was de= layed due to an Unknown Branch detection", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu@INT_MISC.UNKNO= WN_BRANCH_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricGroup": "Fed", + "MetricName": "tma_info_frontend_unknown_branch_cost", + "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired branches who got branch a= ddress clears", + "MetricExpr": "FRONTEND_RETIRED.UNKNOWN_BRANCH * FRONTEND_RETIRED.= UNKNOWN_BRANCH:R / tma_info_thread_clks", + "MetricGroup": "Fed;FetchLat", + "MetricName": "tma_info_frontend_unknown_branches_ret" + }, + { + "BriefDescription": "Branch instructions per taken branch", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_inst_mix_bptkbranch" + }, + { + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Summary;TmaL1;tma_L1_group", + "MetricName": "tma_info_inst_mix_instructions", + "PublicDescription": "Total number of retired Instructions. Sample= with: INST_RETIRED.PREC_DIST" + }, + { + "BriefDescription": "Instructions per FP Arithmetic instruction (l= ower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = FP_ARITH_INST_RETIRED2.SCALAR + (FP_ARITH_INST_RETIRED.VECTOR + FP_ARITH_IN= ST_RETIRED2.VECTOR))", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_inst_mix_iparith", + "MetricThreshold": "tma_info_inst_mix_iparith < 10", + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRE= D2.128B_PACKED_HALF)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_inst_mix_iparith_avx128", + "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRE= D2.256B_PACKED_HALF)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_inst_mix_iparith_avx256", + "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit in= struction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.512B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE + FP_ARITH_INST_RETIRE= D2.512B_PACKED_HALF)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_inst_mix_iparith_avx512", + "MetricThreshold": "tma_info_inst_mix_iparith_avx512 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOU= BLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_inst_mix_iparith_scalar_dp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Half-Pr= ecision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED2.SCALAR", + "MetricGroup": "Flops;FpScalar;InsType;Server", + "MetricName": "tma_info_inst_mix_iparith_scalar_hp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_hp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Half-P= recision instruction (lower number means higher occurrence rate). Values < = 1 are possible due to intentional FMA double counting" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SIN= GLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_inst_mix_iparith_scalar_sp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" + }, + { + "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Branches;Fed;InsType", + "MetricName": "tma_info_inst_mix_ipbranch", + "MetricThreshold": "tma_info_inst_mix_ipbranch < 8" + }, + { + "BriefDescription": "Instructions per (near) call (lower number me= ans higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_inst_mix_ipcall", + "MetricThreshold": "tma_info_inst_mix_ipcall < 200" + }, + { + "BriefDescription": "Instructions per Floating Point (FP) Operatio= n (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (cpu@FP_ARITH_INST_RETIRED.SCALA= R_SINGLE\\,umask\\=3D0x03@ + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE += 4 * cpu@FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE\\,umask\\=3D0x18@ + 8 * c= pu@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE\\,umask\\=3D0x60@ + 16 * FP_ARI= TH_INST_RETIRED.512B_PACKED_SINGLE)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_inst_mix_ipflop", + "MetricThreshold": "tma_info_inst_mix_ipflop < 10" + }, + { + "BriefDescription": "Instructions per Load (lower number means hig= her occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", + "MetricGroup": "InsType", + "MetricName": "tma_info_inst_mix_ipload", + "MetricThreshold": "tma_info_inst_mix_ipload < 3" + }, + { + "BriefDescription": "Instructions per PAUSE (lower number means hi= gher occurrence rate)", + "MetricExpr": "tma_info_inst_mix_instructions / CPU_CLK_UNHALTED.P= AUSE_INST", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_inst_mix_ippause" + }, + { + "BriefDescription": "Instructions per Store (lower number means hi= gher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", + "MetricGroup": "InsType", + "MetricName": "tma_info_inst_mix_ipstore", + "MetricThreshold": "tma_info_inst_mix_ipstore < 8" + }, + { + "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", + "MetricExpr": "INST_RETIRED.ANY / SW_PREFETCH_ACCESS.ANY", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_inst_mix_ipswpf", + "MetricThreshold": "tma_info_inst_mix_ipswpf < 100" + }, + { + "BriefDescription": "Instructions per taken branch", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", + "MetricName": "tma_info_inst_mix_iptb", + "MetricThreshold": "tma_info_inst_mix_iptb < 6 * 2 + 1", + "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp" + }, + { + "BriefDescription": "Average per-core data fill bandwidth to the L= 1 data cache [GB / sec]", + "MetricExpr": "tma_info_memory_l1d_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_core_l1d_cache_fill_bw_2t" + }, + { + "BriefDescription": "Average per-core data fill bandwidth to the L= 2 cache [GB / sec]", + "MetricExpr": "tma_info_memory_l2_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_core_l2_cache_fill_bw_2t" + }, + { + "BriefDescription": "Rate of non silent evictions from the L2 cach= e per Kilo instruction", + "MetricExpr": "1e3 * L2_LINES_OUT.NON_SILENT / tma_info_inst_mix_i= nstructions", + "MetricGroup": "L2Evicts;Mem;Server", + "MetricName": "tma_info_memory_core_l2_evictions_nonsilent_pki" + }, + { + "BriefDescription": "Rate of silent evictions from the L2 cache pe= r Kilo instruction where the evicted lines are dropped (no writeback to L3 = or memory)", + "MetricExpr": "1e3 * L2_LINES_OUT.SILENT / tma_info_inst_mix_instr= uctions", + "MetricGroup": "L2Evicts;Mem;Server", + "MetricName": "tma_info_memory_core_l2_evictions_silent_pki" + }, + { + "BriefDescription": "Average per-core data access bandwidth to the= L3 cache [GB / sec]", + "MetricExpr": "tma_info_memory_l3_cache_access_bw", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_memory_core_l3_cache_access_bw_2t" + }, + { + "BriefDescription": "Average per-core data fill bandwidth to the L= 3 cache [GB / sec]", + "MetricExpr": "tma_info_memory_l3_cache_fill_bw", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_core_l3_cache_fill_bw_2t" + }, + { + "BriefDescription": "Fill Buffer (FB) hits per kilo instructions f= or retired demand loads (L1D misses that merge into ongoing miss-handling e= ntries)", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_fb_hpki" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_l1d_cache_fill_bw" + }, + { + "BriefDescription": "L1 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l1mpki" + }, + { + "BriefDescription": "L1 cache true misses per kilo instruction for= all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.AN= Y", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l1mpki_load" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_l2_cache_fill_bw" + }, + { + "BriefDescription": "L2 cache hits per kilo instruction for all re= quest types (including speculative)", + "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_= RETIRED.ANY", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l2hpki_all" + }, + { + "BriefDescription": "L2 cache hits per kilo instruction for all de= mand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.AN= Y", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l2hpki_load" + }, + { + "BriefDescription": "L2 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "Backend;CacheHits;Mem", + "MetricName": "tma_info_memory_l2mpki" + }, + { + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all request types (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Offcore", + "MetricName": "tma_info_memory_l2mpki_all" + }, + { + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.A= NY", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l2mpki_load" + }, + { + "BriefDescription": "Offcore requests (L2 cache miss) per kilo ins= truction for demand RFOs", + "MetricExpr": "1e3 * L2_RQSTS.RFO_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Offcore", + "MetricName": "tma_info_memory_l2mpki_rfo" + }, + { + "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_memory_l3_cache_access_bw" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_l3_cache_fill_bw" + }, + { + "BriefDescription": "L3 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", + "MetricGroup": "Mem", + "MetricName": "tma_info_memory_l3mpki" + }, + { + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DATA_RD / OFFCORE_REQU= ESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_memory_latency_data_l2_mlp" + }, + { + "BriefDescription": "Average Latency for L2 cache miss demand Load= s", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "LockCont;Memory_Lat;Offcore", + "MetricName": "tma_info_memory_latency_load_l2_miss_latency" + }, + { + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_memory_latency_load_l2_mlp" + }, + { + "BriefDescription": "Average Latency for L3 cache miss demand Load= s", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAND_DATA_RD= / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", + "MetricGroup": "Memory_Lat;Offcore", + "MetricName": "tma_info_memory_latency_load_l3_miss_latency" + }, + { + "BriefDescription": "Actual Average Latency for L1 data-cache miss= demand load operations (in core cycles)", + "MetricExpr": "L1D_PEND_MISS.PENDING / MEM_LOAD_COMPLETED.L1_MISS_= ANY", + "MetricGroup": "Mem;MemoryBound;MemoryLat", + "MetricName": "tma_info_memory_load_miss_real_latency" + }, + { + "BriefDescription": "\"Bus lock\" per kilo instruction", + "MetricExpr": "1e3 * SQ_MISC.BUS_LOCK / INST_RETIRED.ANY", + "MetricGroup": "Mem", + "MetricName": "tma_info_memory_mix_bus_lock_pki" + }, + { + "BriefDescription": "Off-core accesses per kilo instruction for mo= dified write requests", + "MetricExpr": "1e3 * OCR.MODIFIED_WRITE.ANY_RESPONSE / tma_info_in= st_mix_instructions", + "MetricGroup": "Offcore", + "MetricName": "tma_info_memory_mix_offcore_mwrite_any_pki" + }, + { + "BriefDescription": "Off-core accesses per kilo instruction for re= ads-to-core requests (speculative; including in-core HW prefetches)", + "MetricExpr": "1e3 * OCR.READS_TO_CORE.ANY_RESPONSE / tma_info_ins= t_mix_instructions", + "MetricGroup": "CacheHits;Offcore", + "MetricName": "tma_info_memory_mix_offcore_read_any_pki" + }, + { + "BriefDescription": "L3 cache misses per kilo instruction for read= s-to-core requests (speculative; including in-core HW prefetches)", + "MetricExpr": "1e3 * OCR.READS_TO_CORE.L3_MISS / tma_info_inst_mix= _instructions", + "MetricGroup": "Offcore", + "MetricName": "tma_info_memory_mix_offcore_read_l3m_pki" + }, + { + "BriefDescription": "Un-cacheable retired load per kilo instructio= n", + "MetricExpr": "1e3 * MEM_LOAD_MISC_RETIRED.UC / INST_RETIRED.ANY", + "MetricGroup": "Mem", + "MetricName": "tma_info_memory_mix_uc_load_pki" + }, + { + "BriefDescription": "Memory-Level-Parallelism (average number of L= 1 miss demand load when there is at least one such miss", + "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLE= S", + "MetricGroup": "Mem;MemoryBW;MemoryBound", + "MetricName": "tma_info_memory_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)" + }, + { + "BriefDescription": "Rate of L2 HW prefetched lines that were not = used by demand accesses", + "MetricExpr": "L2_LINES_OUT.USELESS_HWPF / (L2_LINES_OUT.SILENT + = L2_LINES_OUT.NON_SILENT)", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_memory_prefetches_useless_hwpf", + "MetricThreshold": "tma_info_memory_prefetches_useless_hwpf > 0.15" + }, + { + "BriefDescription": "Average DRAM BW for Reads-to-Core (R2C) cover= ing for memory attached to local- and remote-socket", + "MetricExpr": "64 * OCR.READS_TO_CORE.DRAM / 1e9 / tma_info_system= _time", + "MetricGroup": "HPC;Mem;MemoryBW;SoC", + "MetricName": "tma_info_memory_soc_r2c_dram_bw", + "PublicDescription": "Average DRAM BW for Reads-to-Core (R2C) cove= ring for memory attached to local- and remote-socket. See R2C_Offcore_BW" + }, + { + "BriefDescription": "Average L3-cache miss BW for Reads-to-Core (R= 2C)", + "MetricExpr": "64 * OCR.READS_TO_CORE.L3_MISS / 1e9 / tma_info_sys= tem_time", + "MetricGroup": "HPC;Mem;MemoryBW;SoC", + "MetricName": "tma_info_memory_soc_r2c_l3m_bw", + "PublicDescription": "Average L3-cache miss BW for Reads-to-Core (= R2C). This covering going to DRAM or other memory off-chip memory tears. Se= e R2C_Offcore_BW" + }, + { + "BriefDescription": "Average Off-core access BW for Reads-to-Core = (R2C)", + "MetricExpr": "64 * OCR.READS_TO_CORE.ANY_RESPONSE / 1e9 / tma_inf= o_system_time", + "MetricGroup": "HPC;Mem;MemoryBW;SoC", + "MetricName": "tma_info_memory_soc_r2c_offcore_bw", + "PublicDescription": "Average Off-core access BW for Reads-to-Core= (R2C). R2C account for demand or prefetch load/RFO/code access that fill d= ata into the Core caches" + }, + { + "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", + "MetricGroup": "Fed;MemoryTLB", + "MetricName": "tma_info_memory_tlb_code_stlb_mpki" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand loads", + "MetricExpr": "MEM_INST_RETIRED.STLB_MISS_LOADS * MEM_INST_RETIRED= .STLB_MISS_LOADS:R / tma_info_thread_clks", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_load_stlb_miss_ret", + "MetricThreshold": "tma_info_memory_tlb_load_stlb_miss_ret > 0.05" + }, + { + "BriefDescription": "STLB (2nd level TLB) data load speculative mi= sses per kilo instruction (misses of any page-size that complete the page w= alk)", + "MetricExpr": "1e3 * DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRE= D.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_load_stlb_mpki" + }, + { + "BriefDescription": "Utilization of the core's Page Walker(s) serv= ing STLB misses triggered by instruction/Load/Store accesses", + "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_P= ENDING + DTLB_STORE_MISSES.WALK_PENDING) / (4 * tma_info_core_core_clks)", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_page_walks_utilization", + "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand stores", + "MetricExpr": "MEM_INST_RETIRED.STLB_MISS_STORES * MEM_INST_RETIRE= D.STLB_MISS_STORES:R / tma_info_thread_clks", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_store_stlb_miss_ret", + "MetricThreshold": "tma_info_memory_tlb_store_stlb_miss_ret > 0.05" + }, + { + "BriefDescription": "STLB (2nd level TLB) data store speculative m= isses per kilo instruction (misses of any page-size that complete the page = walk)", + "MetricExpr": "1e3 * DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIR= ED.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_store_stlb_mpki" + }, + { + "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", + "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", + "MetricName": "tma_info_pipeline_execute" + }, + { + "BriefDescription": "Average number of uops fetched from DSB per c= ycle", + "MetricExpr": "IDQ.DSB_UOPS / IDQ.DSB_CYCLES_ANY", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_pipeline_fetch_dsb" + }, + { + "BriefDescription": "Average number of uops fetched from MITE per = cycle", + "MetricExpr": "IDQ.MITE_UOPS / IDQ.MITE_CYCLES_ANY", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_pipeline_fetch_mite" + }, + { + "BriefDescription": "Instructions per a microcode Assist invocatio= n", + "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", + "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", + "MetricName": "tma_info_pipeline_ipassist", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", + "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)" + }, + { + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D0x1@", + "MetricGroup": "Pipeline;Ret", + "MetricName": "tma_info_pipeline_retire" + }, + { + "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", + "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.SLOTS= \\,cmask\\=3D0x1@", + "MetricGroup": "MicroSeq;Pipeline;Ret", + "MetricName": "tma_info_pipeline_strings_cycles", + "MetricThreshold": "tma_info_pipeline_strings_cycles > 0.1" + }, + { + "BriefDescription": "Fraction of cycles the processor is waiting y= et unhalted; covering legacy PAUSE instruction, as well as C0.1 / C0.2 powe= r-performance optimized states", + "MetricExpr": "CPU_CLK_UNHALTED.C0_WAIT / tma_info_thread_clks", + "MetricGroup": "C0Wait", + "MetricName": "tma_info_system_c0_wait", + "MetricThreshold": "tma_info_system_c0_wait > 0.05" + }, + { + "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", + "MetricGroup": "Power;Summary", + "MetricName": "tma_info_system_core_frequency" + }, + { + "BriefDescription": "Average CPU Utilization (percentage)", + "MetricExpr": "tma_info_system_cpus_utilized / #num_cpus_online", + "MetricGroup": "HPC;Summary", + "MetricName": "tma_info_system_cpu_utilization" + }, + { + "BriefDescription": "Average number of utilized CPUs", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_cpus_utilized" + }, + { + "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", + "MetricExpr": "64 * (UNC_M_CAS_COUNT_SCH0.RD + UNC_M_CAS_COUNT_SCH= 1.RD + UNC_M_CAS_COUNT_SCH0.WR + UNC_M_CAS_COUNT_SCH1.WR) / 1e9 / tma_info_= system_time", + "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", + "MetricName": "tma_info_system_dram_bw_use", + "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_bottleneck_cache_memory_ban= dwidth, tma_fb_full, tma_mem_bandwidth, tma_sq_full" + }, + { + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricExpr": "(cpu@FP_ARITH_INST_RETIRED.SCALAR_SINGLE\\,umask\\= =3D0x03@ + 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * cpu@FP_ARITH_= INST_RETIRED.128B_PACKED_SINGLE\\,umask\\=3D0x18@ + 8 * cpu@FP_ARITH_INST_R= ETIRED.256B_PACKED_SINGLE\\,umask\\=3D0x60@ + 16 * FP_ARITH_INST_RETIRED.51= 2B_PACKED_SINGLE) / 1e9 / tma_info_system_time", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_system_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" + }, + { + "BriefDescription": "Average IO (network or disk) Bandwidth Use fo= r Reads [GB / sec]", + "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e9 / tma_in= fo_system_time", + "MetricGroup": "IoBW;MemOffcore;Server;SoC", + "MetricName": "tma_info_system_io_read_bw", + "PublicDescription": "Average IO (network or disk) Bandwidth Use f= or Reads [GB / sec]. Bandwidth of IO reads that are initiated by end device= controllers that are requesting memory from the CPU" + }, + { + "BriefDescription": "Average IO (network or disk) Bandwidth Use fo= r Writes [GB / sec]", + "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM + UNC_CHA_TOR_INSERTS.= IO_ITOMCACHENEAR) * 64 / 1e9 / tma_info_system_time", + "MetricGroup": "IoBW;MemOffcore;Server;SoC", + "MetricName": "tma_info_system_io_write_bw", + "PublicDescription": "Average IO (network or disk) Bandwidth Use f= or Writes [GB / sec]. Bandwidth of IO writes that are initiated by end devi= ce controllers that are writing memory to the CPU" + }, + { + "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;OS", + "MetricName": "tma_info_system_ipfarbranch", + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" + }, + { + "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricGroup": "OS", + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THRE= AD", + "MetricGroup": "OS", + "MetricName": "tma_info_system_kernel_utilization", + "MetricThreshold": "tma_info_system_kernel_utilization > 0.05" + }, + { + "BriefDescription": "Average latency of data read request to exter= nal DRAM memory [in nanoseconds]", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR / UNC_= CHA_TOR_INSERTS.IA_MISS_DRD_DDR) / cha_0@event\\=3D0x0@", + "MetricGroup": "MemOffcore;MemoryLat;Server;SoC", + "MetricName": "tma_info_system_mem_dram_read_latency", + "PublicDescription": "Average latency of data read request to exte= rnal DRAM memory [in nanoseconds]. Accounts for demand loads and L1/L2 data= -read prefetches" + }, + { + "BriefDescription": "Average number of parallel data read requests= to external memory", + "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / cha@UNC_CHA_TOR= _OCCUPANCY.IA_MISS_DRD\\,thresh\\=3D0x1@", + "MetricGroup": "Mem;MemoryBW;SoC", + "MetricName": "tma_info_system_mem_parallel_reads", + "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches" + }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "(power@energy\\-pkg@ * 61 + 15.6 * power@energy\\-r= am@) / (duration_time * 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" + }, + { + "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", + "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_U= NHALTED.REF_DISTRIBUTED if #SMT_on else 0)", + "MetricGroup": "SMT", + "MetricName": "tma_info_system_smt_2t_utilization" + }, + { + "BriefDescription": "Socket actual clocks when any core is active = on that socket", + "MetricExpr": "cha_0@event\\=3D0x0@", + "MetricGroup": "SoC", + "MetricName": "tma_info_system_socket_clks" + }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, + { + "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", + "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_system_turbo_utilization" + }, + { + "BriefDescription": "Measured Average Uncore Frequency for the SoC= [GHz]", + "MetricExpr": "tma_info_system_socket_clks / 1e9 / tma_info_system= _time", + "MetricGroup": "SoC", + "MetricName": "tma_info_system_uncore_frequency" + }, + { + "BriefDescription": "Cross-socket Ultra Path Interconnect (UPI) da= ta transmit bandwidth for data only [MB / sec]", + "MetricExpr": "UNC_UPI_TxL_FLITS.ALL_DATA * 64 / 9 / 1e6", + "MetricGroup": "Server;SoC", + "MetricName": "tma_info_system_upi_data_transmit_bw" + }, + { + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Pipeline", + "MetricName": "tma_info_thread_clks" + }, + { + "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", + "MetricExpr": "1 / tma_info_thread_ipc", + "MetricGroup": "Mem;Pipeline", + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr" + }, + { + "BriefDescription": "The ratio of Executed- by Issued-Uops", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", + "MetricGroup": "Cor;Pipeline", + "MetricName": "tma_info_thread_execute_per_issue", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" + }, + { + "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", + "MetricExpr": "INST_RETIRED.ANY / tma_info_thread_clks", + "MetricGroup": "Ret;Summary", + "MetricName": "tma_info_thread_ipc" + }, + { + "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "slots", + "MetricGroup": "TmaL1;tma_L1_group", + "MetricName": "tma_info_thread_slots" + }, + { + "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", + "MetricExpr": "(tma_info_thread_slots / (slots / 2) if #SMT_on els= e 1)", + "MetricGroup": "SMT;TmaL1;tma_L1_group", + "MetricName": "tma_info_thread_slots_utilization" + }, + { + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "tma_retiring * tma_info_thread_slots / INST_RETIRED= .ANY", + "MetricGroup": "Pipeline;Ret;Retire", + "MetricName": "tma_info_thread_uoppi", + "MetricThreshold": "tma_info_thread_uoppi > 1.05" + }, + { + "BriefDescription": "Uops per taken branch", + "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW", + "MetricName": "tma_info_thread_uptb", + "MetricThreshold": "tma_info_thread_uptb < 6 * 1.5" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "MetricExpr": "tma_divider - tma_fp_divider", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_int_divider", + "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents overall Integer (Int) = select operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_int_vector_128b + tma_int_vector_256b", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_int_operations", + "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operation= s > 0.6", + "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents 128-bit vector Integer= ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the= CPU has retired", + "MetricExpr": "(INT_VEC_RETIRED.ADD_128 + INT_VEC_RETIRED.VNNI_128= ) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_128b", + "MetricThreshold": "tma_int_vector_128b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_256b, tma= _port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents 256-bit vector Integer= ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fraction= the CPU has retired", + "MetricExpr": "(INT_VEC_RETIRED.ADD_256 + INT_VEC_RETIRED.MUL_256 = + INT_VEC_RETIRED.VNNI_256) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_256b", + "MetricThreshold": "tma_int_vector_256b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", + "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", + "MetricExpr": "max((EXE_ACTIVITY.BOUND_ON_LOADS - MEMORY_ACTIVITY.= STALLS_L1D_MISS) / tma_info_thread_clks, 0)", + "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", + "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - MEMORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", + "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L1D_MISS - MEMORY_ACTIVITY.= STALLS_L2_MISS) / tma_info_thread_clks", + "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L2_HIT * MEM_LOAD_RETIRED.L2_= HIT:R, MEM_LOAD_RETIRED.L2_HIT * (4.4 * tma_info_system_core_frequency)) if= 0 < MEM_LOAD_RETIRED.L2_HIT:R else MEM_LOAD_RETIRED.L2_HIT * (4.4 * tma_in= fo_system_core_frequency)) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRE= D.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled due to loads accesses to L3 cache or contended with a sibling Core", + "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L2_MISS - MEMORY_ACTIVITY.S= TALLS_L3_MISS) / tma_info_thread_clks", + "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L3_HIT * MEM_LOAD_RETIRED.L3_= HIT:R, MEM_LOAD_RETIRED.L3_HIT * (37 * tma_info_system_core_frequency) - 4.= 4 * tma_info_system_core_frequency) if 0 < MEM_LOAD_RETIRED.L3_HIT:R else M= EM_LOAD_RETIRED.L3_HIT * (37 * tma_info_system_core_frequency) - 4.4 * tma_= info_system_core_frequency) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIR= ED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", + "MetricName": "tma_l3_hit_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "DECODE.LCP / tma_info_thread_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Load operations", + "MetricExpr": "UOPS_DISPATCHED.PORT_2_3_10 / (3 * tma_info_core_co= re_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Load operations. Sample with: = UOPS_DISPATCHED.PORT_2_3_10", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) DTLB was missed by load accesses, that late= r on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_dtlb_load - tma_load_stlb_miss)", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", + "MetricName": "tma_load_stlb_hit", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by load accesses, performing a= hardware page walk", + "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", + "MetricName": "tma_load_stlb_miss", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from local memory", + "MetricExpr": "MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * MEM_LOAD_L3_M= ISS_RETIRED.LOCAL_DRAM:R * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.= L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_grou= p", + "MetricName": "tma_local_mem", + "MetricThreshold": "tma_local_mem > 0.1 & tma_mem_latency > 0.1 & = tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_L3_MISS= _RETIRED.LOCAL_DRAM", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", + "MetricExpr": "MEM_INST_RETIRED.LOCK_LOADS * MEM_INST_RETIRED.LOCK= _LOADS:R / tma_info_thread_clks", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Machine Clears", + "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts= )", + "MetricGroup": "BadSpec;BvMS;MachineClears;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches, tma_remote_cache", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to memory bandwidth Allocation= feature (RDT's memory bandwidth throttling)", + "MetricExpr": "INT_MISC.MBA_STALLS / tma_info_thread_clks", + "MetricGroup": "MemoryBW;Offcore;Server;TopdownL5;tma_L5_group;tma= _mem_bandwidth_group", + "MetricName": "tma_mba_stalls", + "MetricThreshold": "tma_mba_stalls > 0.1 & tma_mem_bandwidth > 0.2= & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the performance was likely hurt due to latency from external memory - DRA= M ([SPR-HBM] and/or HBM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", + "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", + "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_b= ound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "13 * MISC2_RETIRED.LFENCE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", + "MetricName": "tma_memory_fence", + "MetricThreshold": "tma_memory_fence > 0.05 & tma_serializing_oper= ation > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "MetricExpr": "tma_light_operations * MEM_UOP_RETIRED.ANY / (tma_r= etiring * tma_info_thread_slots)", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_memory_operations", + "MetricThreshold": "tma_memory_operations > 0.1 & tma_light_operat= ions > 0.6", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "UOPS_RETIRED.MS / tma_info_thread_slots", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_bottleneck_i= rregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, t= ma_ms_switches", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Branch Misprediction= at execution stage", + "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * INT_= MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", + "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", + "MetricName": "tma_mispredicts_resteers", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the MITE pipeline (the legacy deco= de pipeline)", + "MetricExpr": "(IDQ.MITE_CYCLES_ANY - IDQ.MITE_CYCLES_OK) / tma_in= fo_core_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sa= mple with: FRONTEND_RETIRED.ANY_DSB_MISS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "MetricExpr": "160 * ASSISTS.SSE_AVX_MIX / tma_info_thread_clks", + "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", + "MetricName": "tma_mixing_vectors", + "MetricThreshold": "tma_mixing_vectors > 0.05", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "MetricExpr": "max(IDQ.MS_CYCLES_ANY, cpu@UOPS_RETIRED.MS\\,cmask\= \=3D0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY)) / tma_info_core_core_clk= s / 2", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", + "MetricName": "tma_ms", + "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", + "MetricExpr": "3 * cpu@UOPS_RETIRED.MS\\,cmask\\=3D0x1\\,edge\\=3D= 0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY) / tma_info_thread_clks", + "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions that were not fused", + "MetricExpr": "tma_light_operations * (BR_INST_RETIRED.ALL_BRANCHE= S - INST_RETIRED.MACRO_FUSED) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", + "MetricName": "tma_non_fused_branches", + "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring NOP (no op) instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", + "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", + "MetricName": "tma_nop_instructions", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents the remaining light uo= ps fraction the CPU has executed - remaining means not covered by other sib= ling nodes", + "MetricExpr": "max(0, tma_light_operations - (tma_fp_arith + tma_i= nt_operations + tma_memory_operations + tma_fused_instructions + tma_non_fu= sed_branches))", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_other_light_ops", + "MetricThreshold": "tma_other_light_ops > 0.3 & tma_light_operatio= ns > 0.6", + "PublicDescription": "This metric represents the remaining light u= ops fraction the CPU has executed - remaining means not covered by other si= bling nodes. May undercount due to FMA double counting", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", + "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", + "MetricName": "tma_other_mispredicts", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", + "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", + "MetricName": "tma_other_nukes", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Page Faults", + "MetricExpr": "99 * ASSISTS.PAGE_FAULT / tma_info_thread_slots", + "MetricGroup": "TopdownL5;tma_L5_group;tma_assists_group", + "MetricName": "tma_page_faults", + "MetricThreshold": "tma_page_faults > 0.05", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd b= ranch)", + "MetricExpr": "UOPS_DISPATCHED.PORT_0 / tma_info_core_core_clks", + "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", + "MetricName": "tma_port_0", + "MetricThreshold": "tma_port_0 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_int_vector_128b, tma_int_vector_256b, tma_port_1, tma_port_6, tma_po= rts_utilized_2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 1 (ALU)", + "MetricExpr": "UOPS_DISPATCHED.PORT_1 / tma_info_core_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", + "MetricName": "tma_port_1", + "MetricThreshold": "tma_port_1 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_ve= ctor_256b, tma_port_0, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simple= ALU)", + "MetricExpr": "UOPS_DISPATCHED.PORT_6 / tma_info_core_core_clks", + "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", + "MetricName": "tma_port_6", + "MetricThreshold": "tma_port_6 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_por= ts_utilized_2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", + "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_3_PORTS_UTIL)) / tm= a_info_thread_clks if ARITH.DIV_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - EXE_= ACTIVITY.BOUND_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EX= E_ACTIVITY.2_3_PORTS_UTIL) / tma_info_thread_clks)", + "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", + "MetricExpr": "max(EXE_ACTIVITY.EXE_BOUND_0_PORTS - RESOURCE_STALL= S.SCOREBOARD, 0) / tma_info_thread_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", + "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", + "MetricName": "tma_ports_utilized_1", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", + "MetricName": "tma_ports_utilized_2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", + "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", + "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote cache in other socket= s including synchronizations issues", + "MetricExpr": "(MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM * MEM_LOAD_L3= _MISS_RETIRED.REMOTE_HITM:R + MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD * MEM_LOA= D_L3_MISS_RETIRED.REMOTE_FWD:R) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_R= ETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_is= sueSyncxn;tma_mem_latency_group", + "MetricName": "tma_remote_cache", + "MetricThreshold": "tma_remote_cache > 0.05 & tma_mem_latency > 0.= 1 & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2= ", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. Sample with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM, MEM= _LOAD_L3_MISS_RETIRED.REMOTE_FWD. Related metrics: tma_bottleneck_memory_sy= nchronization, tma_contested_accesses, tma_data_sharing, tma_false_sharing,= tma_machine_clears", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote memory", + "MetricExpr": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * MEM_LOAD_L3_= MISS_RETIRED.REMOTE_DRAM:R * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRE= D.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latenc= y_group", + "MetricName": "tma_remote_mem", + "MetricThreshold": "tma_remote_mem > 0.1 & tma_mem_latency > 0.1 &= tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. Sample with: MEM_LOAD_L3_MISS= _RETIRED.REMOTE_DRAM", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by (indirect) RET instruction= s", + "MetricExpr": "BR_MISP_RETIRED.RET_COST * BR_MISP_RETIRED.RET_COST= :R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_ret_mispredicts", + "MetricThreshold": "tma_ret_mispredicts > 0.05 & tma_branch_mispre= dicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "BvUW;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots u= tilized by useful work i.e. issued uops that eventually get retired. Ideall= y; all pipeline slots would be attributed to the Retiring category. Retiri= ng of 100% would indicate the maximum Pipeline_Width throughput was achieve= d. Maximizing Retiring typically increases the Instructions-per-cycle (see= IPC metric). Note that a high Retiring value does not necessary mean there= is no room for more performance. For example; Heavy-operations or Microco= de Assists are categorized under Retiring. They often indicate suboptimal p= erformance and can often be optimized or avoided. Sample with: UOPS_RETIRED= .SLOTS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU issue-pipeline was stalled due to serializing operations", + "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_thread_clks += tma_c02_wait", + "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", + "MetricName": "tma_serializing_operation", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring Shuffle operations of 256-bit vector size (FP or Int= eger)", + "MetricExpr": "tma_light_operations * INT_VEC_RETIRED.SHUFFLES / (= tma_retiring * tma_info_thread_slots)", + "MetricGroup": "HPC;Pipeline;TopdownL4;tma_L4_group;tma_other_ligh= t_ops_group", + "MetricName": "tma_shuffles_256b", + "MetricThreshold": "tma_shuffles_256b > 0.1 & tma_other_light_ops = > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to PAUSE Instructions", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "CPU_CLK_UNHALTED.PAUSE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", + "MetricName": "tma_slow_pause", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.= PAUSE_INST", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles hand= ling memory load split accesses - load that cross 64-byte cache line bounda= ry", + "MetricExpr": "(min(MEM_INST_RETIRED.SPLIT_LOADS * MEM_INST_RETIRE= D.SPLIT_LOADS:R, MEM_INST_RETIRED.SPLIT_LOADS * tma_info_memory_load_miss_r= eal_latency) if 0 < MEM_INST_RETIRED.SPLIT_LOADS:R else MEM_INST_RETIRED.SP= LIT_LOADS * tma_info_memory_load_miss_real_latency) / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents rate of split store ac= cesses", + "MetricExpr": "(min(MEM_INST_RETIRED.SPLIT_STORES * MEM_INST_RETIR= ED.SPLIT_STORES:R, MEM_INST_RETIRED.SPLIT_STORES) if 0 < MEM_INST_RETIRED.S= PLIT_STORES:R else MEM_INST_RETIRED.SPLIT_STORES) / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", + "MetricExpr": "(XQ.FULL_CYCLES + L1D_PEND_MISS.L2_STALLS) / tma_in= fo_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates how often CPU was stall= ed due to RFO store memory accesses; RFO store issue a read-for-ownership = request before the write", + "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates fraction of cyc= les when the memory subsystem had loads blocked since they could not forwar= d data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", + "MetricExpr": "(MEM_STORE_RETIRED.L2_HIT * 10 * (1 - MEM_INST_RETI= RED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_= LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE= _REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Store operations", + "MetricExpr": "(UOPS_DISPATCHED.PORT_4_9 + UOPS_DISPATCHED.PORT_7_= 8) / (4 * tma_info_core_core_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Store operations. Sample with:= UOPS_DISPATCHED.PORT_7_8", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the TLB was missed by store accesses, hitting in the second-l= evel TLB (STLB)", + "MetricExpr": "max(0, tma_dtlb_store - tma_store_stlb_miss)", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", + "MetricName": "tma_store_stlb_hit", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the STLB was missed by store accesses, performing a hardware page wal= k", + "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", + "MetricName": "tma_store_stlb_miss", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates how often CPU was stall= ed due to Streaming store memory accesses; Streaming store optimize out a = read request required by RFO stores", + "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", + "MetricName": "tma_streaming_stores", + "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to new branch address clears", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / tma_info_thread_cl= ks", + "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", + "MetricName": "tma_unknown_branches", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric serves as an approximation of leg= acy x87 usage", + "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", + "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Uncore operating frequency in GHz", + "MetricExpr": "UNC_CHA_CLOCKTICKS / (source_count(UNC_CHA_CLOCKTIC= KS) * #num_packages) / 1e9 / duration_time", + "MetricName": "uncore_frequency", + "ScaleUnit": "1GHz" + }, + { + "BriefDescription": "Intel(R) Ultra Path Interconnect (UPI) data r= eceive bandwidth (MB/sec)", + "MetricExpr": "UNC_UPI_RxL_FLITS.ALL_DATA * 7.111111111111111 / 1e= 6 / duration_time", + "MetricName": "upi_data_receive_bw", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "Intel(R) Ultra Path Interconnect (UPI) data t= ransmit bandwidth (MB/sec)", + "MetricExpr": "UNC_UPI_TxL_FLITS.ALL_DATA * 7.111111111111111 / 1e= 6 / duration_time", + "MetricName": "upi_data_transmit_bw", + "ScaleUnit": "1MB/s" + } +] diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/memory.json b/too= ls/perf/pmu-events/arch/x86/graniterapids/memory.json index 38b74c6752c2..5da5a10275ba 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/memory.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/memory.json @@ -72,7 +72,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_1024", "MSRIndex": "0x3F6", "MSRValue": "0x400", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 1024 cycles. Reporte= d latency may be longer than just the memory latency.", "SampleAfterValue": "53", "UMask": "0x1" @@ -85,7 +84,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_128", "MSRIndex": "0x3F6", "MSRValue": "0x80", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 128 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "1009", "UMask": "0x1" @@ -98,7 +96,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_16", "MSRIndex": "0x3F6", "MSRValue": "0x10", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 16 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "20011", "UMask": "0x1" @@ -111,7 +108,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_2048", "MSRIndex": "0x3F6", "MSRValue": "0x800", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 2048 cycles. Reporte= d latency may be longer than just the memory latency.", "SampleAfterValue": "23", "UMask": "0x1" @@ -124,7 +120,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_256", "MSRIndex": "0x3F6", "MSRValue": "0x100", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 256 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "503", "UMask": "0x1" @@ -137,7 +132,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32", "MSRIndex": "0x3F6", "MSRValue": "0x20", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 32 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -150,7 +144,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4", "MSRIndex": "0x3F6", "MSRValue": "0x4", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 4 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "100003", "UMask": "0x1" @@ -163,7 +156,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_512", "MSRIndex": "0x3F6", "MSRValue": "0x200", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 512 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "101", "UMask": "0x1" @@ -176,7 +168,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64", "MSRIndex": "0x3F6", "MSRValue": "0x40", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 64 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "2003", "UMask": "0x1" @@ -189,7 +180,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_8", "MSRIndex": "0x3F6", "MSRValue": "0x8", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 8 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "50021", "UMask": "0x1" @@ -200,7 +190,6 @@ "Data_LA": "1", "EventCode": "0xcd", "EventName": "MEM_TRANS_RETIRED.STORE_SAMPLE", - "PEBS": "2", "PublicDescription": "Counts Retired memory accesses with at least= 1 store operation. This PEBS event is the precisely-distributed (PDist) tr= igger covering all stores uops for sampling by the PEBS Store Latency Facil= ity. The facility is described in Intel SDM Volume 3 section 19.9.8", "SampleAfterValue": "1000003", "UMask": "0x2" @@ -245,6 +234,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were not supplied by the local socket's L1, L2, or L3 caches and t= he cacheline is homed locally.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.L3_MISS_LOCAL", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F04C04477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that missed the L3 Cache and were supplied by the local socket (DRAM or= PMM), whether or not in Sub NUMA Cluster(SNC) Mode. In SNC Mode counts PM= M or DRAM accesses that are controlled by the close or distant SNC Cluster.= It does not count misses to the L3 which go to Local CXL Type 2 Memory or= Local Non DRAM.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.L3_MISS_LOCAL_SOCKET", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x70CC04477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand data read requests that miss th= e L3 cache.", "Counter": "0,1,2,3", @@ -271,5 +280,95 @@ "PublicDescription": "For every cycle, increments by the number of= demand data read requests pending that are known to have missed the L3 cac= he. Note that this does not capture all elapsed cycles while requests are = outstanding - only cycles from when the requests were known by the requesti= ng core to have missed the L3 cache.", "SampleAfterValue": "2000003", "UMask": "0x10" + }, + { + "BriefDescription": "Number of times an RTM execution aborted.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc9", + "EventName": "RTM_RETIRED.ABORTED", + "PublicDescription": "Counts the number of times RTM abort was tri= ggered.", + "SampleAfterValue": "100003", + "UMask": "0x4" + }, + { + "BriefDescription": "Number of times an RTM execution aborted due = to none of the previous 3 categories (e.g. interrupt)", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc9", + "EventName": "RTM_RETIRED.ABORTED_EVENTS", + "PublicDescription": "Counts the number of times an RTM execution = aborted due to none of the previous 3 categories (e.g. interrupt).", + "SampleAfterValue": "100003", + "UMask": "0x80" + }, + { + "BriefDescription": "Number of times an RTM execution aborted due = to various memory events (e.g. read/write capacity and conflicts)", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc9", + "EventName": "RTM_RETIRED.ABORTED_MEM", + "PublicDescription": "Counts the number of times an RTM execution = aborted due to various memory events (e.g. read/write capacity and conflict= s).", + "SampleAfterValue": "100003", + "UMask": "0x8" + }, + { + "BriefDescription": "Number of times an RTM execution aborted due = to incompatible memory type", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc9", + "EventName": "RTM_RETIRED.ABORTED_MEMTYPE", + "PublicDescription": "Counts the number of times an RTM execution = aborted due to incompatible memory type.", + "SampleAfterValue": "100003", + "UMask": "0x40" + }, + { + "BriefDescription": "Number of times an RTM execution aborted due = to HLE-unfriendly instructions", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc9", + "EventName": "RTM_RETIRED.ABORTED_UNFRIENDLY", + "PublicDescription": "Counts the number of times an RTM execution = aborted due to HLE-unfriendly instructions.", + "SampleAfterValue": "100003", + "UMask": "0x20" + }, + { + "BriefDescription": "Number of times an RTM execution successfully= committed", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc9", + "EventName": "RTM_RETIRED.COMMIT", + "PublicDescription": "Counts the number of times RTM commit succee= ded.", + "SampleAfterValue": "100003", + "UMask": "0x2" + }, + { + "BriefDescription": "Number of times an RTM execution started.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc9", + "EventName": "RTM_RETIRED.START", + "PublicDescription": "Counts the number of times we entered an RTM= region. Does not count nested transactions.", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Speculatively counts the number of TSX aborts= due to a data capacity limitation for transactional reads", + "Counter": "0,1,2,3", + "EventCode": "0x54", + "EventName": "TX_MEM.ABORT_CAPACITY_READ", + "PublicDescription": "Speculatively counts the number of Transacti= onal Synchronization Extensions (TSX) aborts due to a data capacity limitat= ion for transactional reads", + "SampleAfterValue": "100003", + "UMask": "0x80" + }, + { + "BriefDescription": "Speculatively counts the number of TSX aborts= due to a data capacity limitation for transactional writes.", + "Counter": "0,1,2,3", + "EventCode": "0x54", + "EventName": "TX_MEM.ABORT_CAPACITY_WRITE", + "PublicDescription": "Speculatively counts the number of Transacti= onal Synchronization Extensions (TSX) aborts due to a data capacity limitat= ion for transactional writes.", + "SampleAfterValue": "100003", + "UMask": "0x2" + }, + { + "BriefDescription": "Number of times a transactional abort was sig= naled due to a data conflict on a transactionally accessed address", + "Counter": "0,1,2,3", + "EventCode": "0x54", + "EventName": "TX_MEM.ABORT_CONFLICT", + "PublicDescription": "Counts the number of times a TSX line had a = cache conflict.", + "SampleAfterValue": "100003", + "UMask": "0x1" } ] diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/metricgroups.json= b/tools/perf/pmu-events/arch/x86/graniterapids/metricgroups.json new file mode 100644 index 000000000000..9129fb7b7ce4 --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/graniterapids/metricgroups.json @@ -0,0 +1,145 @@ +{ + "Backend": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "Bad": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "BadSpec": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "BigFootprint": "Grouping from Top-down Microarchitecture Analysis Met= rics spreadsheet", + "BrMispredicts": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", + "Branches": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "BvBC": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvBO": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvCB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvFB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvIO": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvML": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMP": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMS": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMT": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvOB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvUW": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "C0Wait": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "CacheHits": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "CacheMisses": "Grouping from Top-down Microarchitecture Analysis Metr= ics spreadsheet", + "CodeGen": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "Compute": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "Cor": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "DSB": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "DSBmiss": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "DataSharing": "Grouping from Top-down Microarchitecture Analysis Metr= ics spreadsheet", + "Fed": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "FetchBW": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "FetchLat": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "Flops": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "FpScalar": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "FpVector": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "Frontend": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "HPC": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "IcMiss": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "InsType": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "IntVector": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "IoBW": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", + "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", + "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "MemOffcore": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", + "MemoryBW": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "MemoryBound": "Grouping from Top-down Microarchitecture Analysis Metr= ics spreadsheet", + "MemoryLat": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "MemoryTLB": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "Memory_BW": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "Memory_Lat": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", + "MicroSeq": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "OS": "Grouping from Top-down Microarchitecture Analysis Metrics sprea= dsheet", + "Offcore": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "PGO": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Pipeline": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "PortsUtil": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "Power": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "Prefetches": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", + "Ret": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Retire": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "SMT": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Server": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "Snoop": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "SoC": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Summary": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "TmaL1": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "TmaL2": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "TmaL3mem": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "TopdownL1": "Metrics for top-down breakdown at level 1", + "TopdownL2": "Metrics for top-down breakdown at level 2", + "TopdownL3": "Metrics for top-down breakdown at level 3", + "TopdownL4": "Metrics for top-down breakdown at level 4", + "TopdownL5": "Metrics for top-down breakdown at level 5", + "TopdownL6": "Metrics for top-down breakdown at level 6", + "tma_L1_group": "Metrics for top-down breakdown at level 1", + "tma_L2_group": "Metrics for top-down breakdown at level 2", + "tma_L3_group": "Metrics for top-down breakdown at level 3", + "tma_L4_group": "Metrics for top-down breakdown at level 4", + "tma_L5_group": "Metrics for top-down breakdown at level 5", + "tma_L6_group": "Metrics for top-down breakdown at level 6", + "tma_alu_op_utilization_group": "Metrics contributing to tma_alu_op_ut= ilization category", + "tma_assists_group": "Metrics contributing to tma_assists category", + "tma_backend_bound_group": "Metrics contributing to tma_backend_bound = category", + "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", + "tma_branch_mispredicts_group": "Metrics contributing to tma_branch_mi= spredicts category", + "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", + "tma_code_stlb_miss_group": "Metrics contributing to tma_code_stlb_mis= s category", + "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", + "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", + "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", + "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", + "tma_fetch_bandwidth_group": "Metrics contributing to tma_fetch_bandwi= dth category", + "tma_fetch_latency_group": "Metrics contributing to tma_fetch_latency = category", + "tma_fp_arith_group": "Metrics contributing to tma_fp_arith category", + "tma_fp_vector_group": "Metrics contributing to tma_fp_vector category= ", + "tma_frontend_bound_group": "Metrics contributing to tma_frontend_boun= d category", + "tma_heavy_operations_group": "Metrics contributing to tma_heavy_opera= tions category", + "tma_icache_misses_group": "Metrics contributing to tma_icache_misses = category", + "tma_int_operations_group": "Metrics contributing to tma_int_operation= s category", + "tma_issue2P": "Metrics related by the issue $issue2P", + "tma_issueBM": "Metrics related by the issue $issueBM", + "tma_issueBW": "Metrics related by the issue $issueBW", + "tma_issueComp": "Metrics related by the issue $issueComp", + "tma_issueD0": "Metrics related by the issue $issueD0", + "tma_issueFB": "Metrics related by the issue $issueFB", + "tma_issueFL": "Metrics related by the issue $issueFL", + "tma_issueL1": "Metrics related by the issue $issueL1", + "tma_issueLat": "Metrics related by the issue $issueLat", + "tma_issueMC": "Metrics related by the issue $issueMC", + "tma_issueMS": "Metrics related by the issue $issueMS", + "tma_issueMV": "Metrics related by the issue $issueMV", + "tma_issueRFO": "Metrics related by the issue $issueRFO", + "tma_issueSL": "Metrics related by the issue $issueSL", + "tma_issueSO": "Metrics related by the issue $issueSO", + "tma_issueSmSt": "Metrics related by the issue $issueSmSt", + "tma_issueSpSt": "Metrics related by the issue $issueSpSt", + "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", + "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", + "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", + "tma_l2_bound_group": "Metrics contributing to tma_l2_bound category", + "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", + "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", + "tma_load_op_utilization_group": "Metrics contributing to tma_load_op_= utilization category", + "tma_load_stlb_miss_group": "Metrics contributing to tma_load_stlb_mis= s category", + "tma_machine_clears_group": "Metrics contributing to tma_machine_clear= s category", + "tma_mem_bandwidth_group": "Metrics contributing to tma_mem_bandwidth = category", + "tma_mem_latency_group": "Metrics contributing to tma_mem_latency cate= gory", + "tma_memory_bound_group": "Metrics contributing to tma_memory_bound ca= tegory", + "tma_microcode_sequencer_group": "Metrics contributing to tma_microcod= e_sequencer category", + "tma_mite_group": "Metrics contributing to tma_mite category", + "tma_other_light_ops_group": "Metrics contributing to tma_other_light_= ops category", + "tma_ports_utilization_group": "Metrics contributing to tma_ports_util= ization category", + "tma_ports_utilized_0_group": "Metrics contributing to tma_ports_utili= zed_0 category", + "tma_ports_utilized_3m_group": "Metrics contributing to tma_ports_util= ized_3m category", + "tma_retiring_group": "Metrics contributing to tma_retiring category", + "tma_serializing_operation_group": "Metrics contributing to tma_serial= izing_operation category", + "tma_store_bound_group": "Metrics contributing to tma_store_bound cate= gory", + "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category", + "tma_store_stlb_miss_group": "Metrics contributing to tma_store_stlb_m= iss category" +} diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/other.json b/tool= s/perf/pmu-events/arch/x86/graniterapids/other.json index 8b9b3c920934..7a919999b78b 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/other.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/other.json @@ -1,4 +1,13 @@ [ + { + "BriefDescription": "Count all other hardware assists or traps tha= t are not necessarily architecturally exposed (through a software handler) = beyond FP; SSE-AVX mix and A/D assists who are counted by dedicated sub-eve= nts. the event also counts for Machine Ordering count.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc1", + "EventName": "ASSISTS.HARDWARE", + "PublicDescription": "Count all other hardware assists or traps th= at are not necessarily architecturally exposed (through a software handler)= beyond FP; SSE-AVX mix and A/D assists who are counted by dedicated sub-ev= ents. This includes, but not limited to, assists at EXE or MEM uop writeba= ck like AVX* load/store/gather/scatter (non-FP GSSE-assist ) , assists gene= rated by ROB like PEBS and RTIT, Uncore trap, RAR (Remote Action Request) a= nd CET (Control flow Enforcement Technology) assists. the event also counts= for Machine Ordering count.", + "SampleAfterValue": "100003", + "UMask": "0x4" + }, { "BriefDescription": "ASSISTS.PAGE_FAULT", "Counter": "0,1,2,3,4,5,6,7", @@ -65,6 +74,36 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM attached to this socket, unless in Sub NUMA Cluster(SNC) Mode. In S= NC Mode counts only those DRAM accesses that are controlled by the close SN= C Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x104000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM attached to another socket.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.REMOTE_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x730000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that have a= ny type of response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F3FFC0002", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) reque= sts and software prefetches for exclusive ownership (PREFETCHW) that were s= upplied by DRAM.", "Counter": "0,1,2,3", @@ -115,6 +154,56 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM attached to this socket, unless in Sub NUMA = Cluster(SNC) Mode. In SNC Mode counts only those DRAM accesses that are co= ntrolled by the close SNC Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x104004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM attached to this socket, whether or not in S= ub NUMA Cluster(SNC) Mode. In SNC Mode counts DRAM accesses that are contr= olled by the close or distant SNC Cluster.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.LOCAL_SOCKET_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x70C004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were not supplied by the local socket's L1, L2, or L3 caches and w= ere supplied by a remote socket.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.REMOTE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F33004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM attached to another socket.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.REMOTE_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x730004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts all (cacheable) data read, code read a= nd RFO requests including demands and prefetches to the core caches (L1 or = L2) that were supplied by DRAM or PMM attached to another socket.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.READS_TO_CORE.REMOTE_MEMORY", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x733004477", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts streaming stores that have any type of= response.", "Counter": "0,1,2,3", @@ -125,6 +214,16 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts Demand RFOs, ItoM's, PREFECTHW's, Hard= ware RFO Prefetches to the L1/L2 and Streaming stores that likely resulted = in a store to Memory (DRAM or PMM)", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.WRITE_ESTIMATE.MEMORY", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0xFBFF80822", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Cycles when Reservation Station (RS) is empty= for the thread.", "Counter": "0,1,2,3,4,5,6,7", diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/pipeline.json b/t= ools/perf/pmu-events/arch/x86/graniterapids/pipeline.json index 0ef9daf64e2e..da6478607984 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/pipeline.json @@ -32,7 +32,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all branch instructions retired.", "SampleAfterValue": "400009" }, @@ -41,7 +40,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts conditional branch instructions retir= ed.", "SampleAfterValue": "400009", "UMask": "0x11" @@ -51,7 +49,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts not taken branch instructions retired= .", "SampleAfterValue": "400009", "UMask": "0x10" @@ -61,7 +58,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional branch instructions= retired.", "SampleAfterValue": "400009", "UMask": "0x1" @@ -71,7 +67,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.FAR_BRANCH", - "PEBS": "1", "PublicDescription": "Counts far branch instructions retired.", "SampleAfterValue": "100007", "UMask": "0x40" @@ -81,7 +76,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts near indirect branch instructions ret= ired excluding returns. TSX abort is an indirect branch.", "SampleAfterValue": "100003", "UMask": "0x80" @@ -91,7 +85,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_CALL", - "PEBS": "1", "PublicDescription": "Counts both direct and indirect near call in= structions retired.", "SampleAfterValue": "100007", "UMask": "0x2" @@ -101,7 +94,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_RETURN", - "PEBS": "1", "PublicDescription": "Counts return instructions retired.", "SampleAfterValue": "100007", "UMask": "0x8" @@ -111,7 +103,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken branch instructions retired.", "SampleAfterValue": "400009", "UMask": "0x20" @@ -121,7 +112,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all the retired branch instructions t= hat were mispredicted by the processor. A branch misprediction occurs when = the processor incorrectly predicts the destination of the branch. When the= misprediction is discovered at execution, all the instructions executed in= the wrong (speculative) path must be discarded, and the processor must sta= rt fetching from the correct path.", "SampleAfterValue": "400009" }, @@ -130,7 +120,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.ALL_BRANCHES_COST", - "PEBS": "1", "SampleAfterValue": "400009", "UMask": "0x44" }, @@ -139,7 +128,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts mispredicted conditional branch instr= uctions retired.", "SampleAfterValue": "400009", "UMask": "0x11" @@ -149,7 +137,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_COST", - "PEBS": "1", "SampleAfterValue": "400009", "UMask": "0x51" }, @@ -158,7 +145,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts the number of conditional branch inst= ructions retired that were mispredicted and the branch direction was not ta= ken.", "SampleAfterValue": "400009", "UMask": "0x10" @@ -168,7 +154,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_NTAKEN_COST", - "PEBS": "1", "SampleAfterValue": "400009", "UMask": "0x50" }, @@ -177,7 +162,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional mispredicted branch= instructions retired.", "SampleAfterValue": "400009", "UMask": "0x1" @@ -187,7 +171,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_TAKEN_COST", - "PEBS": "1", "SampleAfterValue": "400009", "UMask": "0x41" }, @@ -196,7 +179,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts miss-predicted near indirect branch i= nstructions retired excluding returns. TSX abort is an indirect branch.", "SampleAfterValue": "100003", "UMask": "0x80" @@ -206,7 +188,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT_CALL", - "PEBS": "1", "PublicDescription": "Counts retired mispredicted indirect (near t= aken) CALL instructions, including both register and memory indirect.", "SampleAfterValue": "400009", "UMask": "0x2" @@ -216,7 +197,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT_CALL_COST", - "PEBS": "1", "SampleAfterValue": "400009", "UMask": "0x42" }, @@ -225,7 +205,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT_COST", - "PEBS": "1", "SampleAfterValue": "100003", "UMask": "0xc0" }, @@ -234,7 +213,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts number of near branch instructions re= tired that were mispredicted and taken.", "SampleAfterValue": "400009", "UMask": "0x20" @@ -244,7 +222,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.NEAR_TAKEN_COST", - "PEBS": "1", "SampleAfterValue": "400009", "UMask": "0x60" }, @@ -253,7 +230,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.RET", - "PEBS": "1", "PublicDescription": "This is a non-precise version (that is, does= not use PEBS) of the event that counts mispredicted return instructions re= tired.", "SampleAfterValue": "100007", "UMask": "0x8" @@ -263,7 +239,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.RET_COST", - "PEBS": "1", "SampleAfterValue": "100007", "UMask": "0x48" }, @@ -511,7 +486,6 @@ "BriefDescription": "Number of instructions retired. Fixed Counter= - architectural event", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003", "UMask": "0x1" @@ -521,7 +495,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.ANY_P", - "PEBS": "1", "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003" }, @@ -530,7 +503,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.MACRO_FUSED", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x10" }, @@ -539,7 +511,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.NOP", - "PEBS": "1", "PublicDescription": "Counts all retired NOP or ENDBR32/64 or PREF= ETCHIT0/1 instructions", "SampleAfterValue": "2000003", "UMask": "0x2" @@ -548,7 +519,6 @@ "BriefDescription": "Precise instruction retired with PEBS precise= -distribution", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.PREC_DIST", - "PEBS": "1", "PublicDescription": "A version of INST_RETIRED that allows for a = precise distribution of samples across instructions retired. It utilizes th= e Precise Distribution of Instructions Retired (PDIR++) feature to fix bias= in how retired instructions get sampled. Use on Fixed Counter 0.", "SampleAfterValue": "2000003", "UMask": "0x1" @@ -558,7 +528,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.REP_ITERATION", - "PEBS": "1", "PublicDescription": "Number of iterations of Repeat (REP) string = retired instructions such as MOVS, CMPS, and SCAS. Each has a byte, word, a= nd doubleword version and string instructions can be repeated using a repet= ition prefix, REP, that allows their architectural execution to be repeated= a number of times as specified by the RCX register. Note the number of ite= rations is implementation-dependent.", "SampleAfterValue": "2000003", "UMask": "0x8" @@ -788,6 +757,15 @@ "SampleAfterValue": "100003", "UMask": "0x20" }, + { + "BriefDescription": "Cycles stalled due to no store buffers availa= ble. (not including draining form sync).", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xa2", + "EventName": "RESOURCE_STALLS.SB", + "PublicDescription": "Counts allocation stall cycles caused by the= store buffer (SB) being full. This counts cycles that the pipeline back-en= d blocked uop delivery from the front-end.", + "SampleAfterValue": "100003", + "UMask": "0x8" + }, { "BriefDescription": "Counts cycles where the pipeline is stalled d= ue to serializing operations.", "Counter": "0,1,2,3,4,5,6,7", diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/uncore-cache.json= b/tools/perf/pmu-events/arch/x86/graniterapids/uncore-cache.json index e0a45d4ea848..008ceba86fd9 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/uncore-cache.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/uncore-cache.json @@ -1056,7 +1056,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Code read from local IA that miss the cache", + "BriefDescription": "Code read from local IA", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_CRD", @@ -1741,7 +1741,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Read for ownership from local IA that miss th= e cache", + "BriefDescription": "Read for ownership from local IA that miss th= e LLC targeting local memory", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_RFO_LOCAL", @@ -1770,7 +1770,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Read for ownership prefetch from local IA tha= t miss the cache", + "BriefDescription": "Read for ownership prefetch from local IA tha= t miss the LLC targeting local memory", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_RFO_PREF_LOCAL", @@ -1780,7 +1780,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Read for ownership prefetch from local IA tha= t miss the cache", + "BriefDescription": "Read for ownership prefetch from local IA tha= t miss the LLC targeting remote memory", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_RFO_PREF_REMOTE", @@ -1790,7 +1790,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Read for ownership from local IA that miss th= e cache", + "BriefDescription": "Read for ownership from local IA that miss th= e LLC targeting remote memory", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_RFO_REMOTE", @@ -1882,7 +1882,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Read for ownership from local IA that miss th= e cache", + "BriefDescription": "Read for ownership from local IA", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_RFO", @@ -1892,7 +1892,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Read for ownership prefetch from local IA tha= t miss the cache", + "BriefDescription": "Read for ownership prefetch from local IA", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_RFO_PREF", diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/uncore-interconne= ct.json b/tools/perf/pmu-events/arch/x86/graniterapids/uncore-interconnect.= json index 856ee985ecd4..5c50275c79b0 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/uncore-interconnect.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/uncore-interconnect.json @@ -808,6 +808,26 @@ "PerPkg": "1", "Unit": "IRP" }, + { + "BriefDescription": "Counts Timeouts - Set 0 : Fastpath Rejects", + "Counter": "0,1,2,3", + "EventCode": "0x1E", + "EventName": "UNC_I_MISC0.FAST_REJ", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x2", + "Unit": "IRP" + }, + { + "BriefDescription": "Counts Timeouts - Set 0 : Fastpath Requests", + "Counter": "0,1,2,3", + "EventCode": "0x1E", + "EventName": "UNC_I_MISC0.FAST_REQ", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x1", + "Unit": "IRP" + }, { "BriefDescription": "Misc Events - Set 1 : Lost Forward : Snoop pu= lled away ownership before a write was committed", "Counter": "0,1,2,3", @@ -818,6 +838,46 @@ "UMask": "0x10", "Unit": "IRP" }, + { + "BriefDescription": "Snoop Hit E/S responses", + "Counter": "0,1,2,3", + "EventCode": "0x12", + "EventName": "UNC_I_SNOOP_RESP.ALL_HIT_ES", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x74", + "Unit": "IRP" + }, + { + "BriefDescription": "Snoop Hit I responses", + "Counter": "0,1,2,3", + "EventCode": "0x12", + "EventName": "UNC_I_SNOOP_RESP.ALL_HIT_I", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x72", + "Unit": "IRP" + }, + { + "BriefDescription": "Snoop Hit M responses", + "Counter": "0,1,2,3", + "EventCode": "0x12", + "EventName": "UNC_I_SNOOP_RESP.ALL_HIT_M", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x78", + "Unit": "IRP" + }, + { + "BriefDescription": "Snoop miss responses", + "Counter": "0,1,2,3", + "EventCode": "0x12", + "EventName": "UNC_I_SNOOP_RESP.ALL_MISS", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x71", + "Unit": "IRP" + }, { "BriefDescription": "Inbound write (fast path) requests to coheren= t memory, received by the IRP resulting in write ownership requests issued = by IRP to the mesh.", "Counter": "0,1,2,3", @@ -1550,6 +1610,33 @@ "UMask": "0x4", "Unit": "UPI" }, + { + "BriefDescription": "Cycles in L0p", + "Counter": "0,1,2,3", + "EventCode": "0x27", + "EventName": "UNC_UPI_TxL0P_POWER_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "Unit": "UPI" + }, + { + "BriefDescription": "UNC_UPI_TxL0P_POWER_CYCLES_LL_ENTER", + "Counter": "0,1,2,3", + "EventCode": "0x28", + "EventName": "UNC_UPI_TxL0P_POWER_CYCLES_LL_ENTER", + "Experimental": "1", + "PerPkg": "1", + "Unit": "UPI" + }, + { + "BriefDescription": "UNC_UPI_TxL0P_POWER_CYCLES_M3_EXIT", + "Counter": "0,1,2,3", + "EventCode": "0x29", + "EventName": "UNC_UPI_TxL0P_POWER_CYCLES_M3_EXIT", + "Experimental": "1", + "PerPkg": "1", + "Unit": "UPI" + }, { "BriefDescription": "Matches on Transmit path of a UPI Port : Non-= Coherent Bypass", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/uncore-io.json b/= tools/perf/pmu-events/arch/x86/graniterapids/uncore-io.json index cffb9d94b53d..886b99a971be 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/uncore-io.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/uncore-io.json @@ -17,7 +17,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -29,7 +29,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -41,7 +41,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -53,7 +53,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -65,7 +65,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -77,7 +77,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -89,7 +89,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -101,7 +101,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -113,7 +113,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -125,7 +125,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff0ff", + "UMask": "0xff", "Unit": "IIO" }, { @@ -137,7 +137,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -149,7 +149,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -161,7 +161,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -173,7 +173,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -185,7 +185,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -197,7 +197,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -209,7 +209,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040040", + "UMask": "0x40", "Unit": "IIO" }, { @@ -221,7 +221,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080080", + "UMask": "0x80", "Unit": "IIO" }, { @@ -233,7 +233,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -245,7 +245,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -257,7 +257,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -269,7 +269,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -281,7 +281,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -293,7 +293,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -305,7 +305,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -317,7 +317,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -329,7 +329,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -341,7 +341,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -352,7 +352,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -363,7 +363,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -374,7 +374,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -385,7 +385,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -396,7 +396,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -407,7 +407,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -418,7 +418,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -429,7 +429,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -440,7 +440,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -451,7 +451,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -462,7 +462,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -473,7 +473,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -484,7 +484,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x02", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -495,7 +495,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x04", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -506,7 +506,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x08", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -517,7 +517,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x10", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -528,7 +528,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x20", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -539,7 +539,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x40", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -550,7 +550,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x80", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -561,7 +561,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -572,7 +572,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -583,7 +583,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x02", - "UMask": "0x7002001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -594,7 +594,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x04", - "UMask": "0x7004001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -605,7 +605,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x08", - "UMask": "0x7008001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -616,7 +616,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x10", - "UMask": "0x7010001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -627,7 +627,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x20", - "UMask": "0x7020001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -638,7 +638,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x40", - "UMask": "0x7040001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -649,7 +649,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x80", - "UMask": "0x7080001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -661,7 +661,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -673,7 +673,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -685,7 +685,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -697,7 +697,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -709,7 +709,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -721,7 +721,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -733,7 +733,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -745,7 +745,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -757,7 +757,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -769,7 +769,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -781,7 +781,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -793,7 +793,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -805,7 +805,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -817,7 +817,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -829,7 +829,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -841,7 +841,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -853,7 +853,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1129,7 +1129,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1141,7 +1141,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x700f020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -1153,7 +1153,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x700f008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1165,7 +1165,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x700f002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1177,7 +1177,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x700f001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1189,7 +1189,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x700f004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1201,7 +1201,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x700f010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -1213,7 +1213,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff080", + "UMask": "0x80", "Unit": "IIO" }, { @@ -1225,7 +1225,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff040", + "UMask": "0x40", "Unit": "IIO" }, { @@ -1237,7 +1237,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -1249,7 +1249,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1261,7 +1261,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1273,7 +1273,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1285,7 +1285,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -1297,7 +1297,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1318,7 +1318,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1329,7 +1329,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1340,7 +1340,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1351,7 +1351,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1362,7 +1362,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1373,7 +1373,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1384,7 +1384,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1395,7 +1395,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1406,7 +1406,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1417,7 +1417,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1428,7 +1428,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1439,7 +1439,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1450,7 +1450,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1461,7 +1461,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1472,7 +1472,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1483,7 +1483,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1494,7 +1494,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1505,7 +1505,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1516,7 +1516,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1527,7 +1527,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1538,7 +1538,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1549,7 +1549,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1560,7 +1560,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1571,7 +1571,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1582,7 +1582,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1593,7 +1593,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1604,7 +1604,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1615,7 +1615,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1626,7 +1626,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1637,7 +1637,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1648,7 +1648,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1659,7 +1659,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1670,7 +1670,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1681,7 +1681,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1692,7 +1692,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1703,7 +1703,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1715,7 +1715,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1727,7 +1727,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1739,7 +1739,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1751,7 +1751,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1763,7 +1763,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1775,7 +1775,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1787,7 +1787,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1799,7 +1799,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1811,7 +1811,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1823,7 +1823,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1835,7 +1835,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1847,7 +1847,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1859,7 +1859,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1871,7 +1871,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1883,7 +1883,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1895,7 +1895,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080002", + "UMask": "0x2", "Unit": "IIO" } ] diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/uncore-memory.jso= n b/tools/perf/pmu-events/arch/x86/graniterapids/uncore-memory.json index 08e410b9b0a2..5f4783ff6ce5 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/uncore-memory.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/uncore-memory.json @@ -169,7 +169,7 @@ "Unit": "IMC" }, { - "BriefDescription": "Number of DRAM DCLK clock cycles while the ev= ent is enabled", + "BriefDescription": "Number of DRAM DCLK clock cycles while the ev= ent is enabled. DCLK is 1/4 of DRAM data rate.", "Counter": "0,1,2,3", "EventCode": "0x01", "EventName": "UNC_M_CLOCKTICKS", @@ -188,6 +188,104 @@ "PublicDescription": "DRAM Clockticks", "Unit": "IMC" }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH0_RANK0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x1", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH0_RANK1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x2", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH0_RANK2", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x4", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH0_RANK3", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x8", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH1_RANK0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x10", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH1_RANK1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x20", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH1_RANK2", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x40", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH1_RANK3", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x80", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e and all pages are closed", + "Counter": "0,1,2,3", + "EventCode": "0x88", + "EventName": "UNC_M_POWER_CHANNEL_PPD_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "Unit": "IMC" + }, { "BriefDescription": "DRAM Precharge commands. : Counts the number = of DRAM Precharge commands sent on this channel.", "Counter": "0,1,2,3", @@ -358,6 +456,28 @@ "PerPkg": "1", "Unit": "IMC" }, + { + "BriefDescription": "subevent0 - # of cycles all ranks were in SR = subevent1 - # of times all ranks went into SR subevent2 -# of times ps_sr_= active asserted (SRE) subevent3 - # of times ps_sr_active deasserted (SRX) = subevent4 - # of times PS-&>Refresh ps_sr_req asserted (SRE) subevent5 - # = of times PS-&>Refresh ps_sr_req deasserted (SRX) subevent6 - # of cycles PS= Ctrlr FSM was in FATAL", + "Counter": "0,1,2,3", + "EventCode": "0x43", + "EventName": "UNC_M_SELF_REFRESH.ENTER_SUCCESS", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "UNC_M_SELF_REFRESH.ENTER_SUCCESS", + "UMask": "0x2", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles all ranks were in SR", + "Counter": "0,1,2,3", + "EventCode": "0x43", + "EventName": "UNC_M_SELF_REFRESH.ENTER_SUCCESS_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x1", + "Unit": "IMC" + }, { "BriefDescription": "Write Pending Queue Allocations", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/graniterapids/uncore-power.json= b/tools/perf/pmu-events/arch/x86/graniterapids/uncore-power.json index 02e59f64a544..9ea852ef190e 100644 --- a/tools/perf/pmu-events/arch/x86/graniterapids/uncore-power.json +++ b/tools/perf/pmu-events/arch/x86/graniterapids/uncore-power.json @@ -7,5 +7,103 @@ "PerPkg": "1", "PublicDescription": "PCU Clockticks: The PCU runs off a fixed 1 = GHz clock. This event counts the number of pclk cycles measured while the = counter was enabled. The pclk, like the Memory Controller's dclk, counts a= t a constant rate making it a good measure of actual wall time.", "Unit": "PCU" + }, + { + "BriefDescription": "Thermal Strongest Upper Limit Cycles", + "Counter": "0,1,2,3", + "EventCode": "0x04", + "EventName": "UNC_P_FREQ_MAX_LIMIT_THERMAL_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Thermal Strongest Upper Limit Cycles : Numbe= r of cycles any frequency is reduced due to a thermal limit. Count only if= throttling is occurring.", + "Unit": "PCU" + }, + { + "BriefDescription": "Power Strongest Upper Limit Cycles", + "Counter": "0,1,2,3", + "EventCode": "0x05", + "EventName": "UNC_P_FREQ_MAX_POWER_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Power Strongest Upper Limit Cycles : Counts = the number of cycles when power is the upper limit on frequency.", + "Unit": "PCU" + }, + { + "BriefDescription": "Cycles spent changing Frequency", + "Counter": "0,1,2,3", + "EventCode": "0x74", + "EventName": "UNC_P_FREQ_TRANS_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Cycles spent changing Frequency : Counts the= number of cycles when the system is changing frequency. This can not be f= iltered by thread ID. One can also use it with the occupancy counter that = monitors number of threads in C0 to estimate the performance impact that fr= equency transitions had on the system.", + "Unit": "PCU" + }, + { + "BriefDescription": "Package C State Residency - C2E", + "Counter": "0,1,2,3", + "EventCode": "0x2b", + "EventName": "UNC_P_PKG_RESIDENCY_C2E_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Package C State Residency - C2E : Counts the= number of cycles when the package was in C2E. This event can be used in c= onjunction with edge detect to count C2E entrances (or exits using invert).= Residency events do not include transition times.", + "Unit": "PCU" + }, + { + "BriefDescription": "Package C State Residency - C6", + "Counter": "0,1,2,3", + "EventCode": "0x2d", + "EventName": "UNC_P_PKG_RESIDENCY_C6_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Package C State Residency - C6 : Counts the = number of cycles when the package was in C6. This event can be used in con= junction with edge detect to count C6 entrances (or exits using invert). R= esidency events do not include transition times.", + "Unit": "PCU" + }, + { + "BriefDescription": "Number of cores in C0", + "Counter": "0,1,2,3", + "EventCode": "0x35", + "EventName": "UNC_P_POWER_STATE_OCCUPANCY_CORES_C0", + "PerPkg": "1", + "PublicDescription": "Number of cores in C0 : This is an occupancy= event that tracks the number of cores that are in the chosen C-State. It = can be used by itself to get the average number of cores in that C-state wi= th thresholding to generate histograms, or with other PCU events and occupa= ncy triggering to capture other details.", + "Unit": "PCU" + }, + { + "BriefDescription": "Number of cores in C3", + "Counter": "0,1,2,3", + "EventCode": "0x36", + "EventName": "UNC_P_POWER_STATE_OCCUPANCY_CORES_C3", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Number of cores in C3 : This is an occupancy= event that tracks the number of cores that are in the chosen C-State. It = can be used by itself to get the average number of cores in that C-state wi= th thresholding to generate histograms, or with other PCU events and occupa= ncy triggering to capture other details.", + "Unit": "PCU" + }, + { + "BriefDescription": "Number of cores in C6", + "Counter": "0,1,2,3", + "EventCode": "0x37", + "EventName": "UNC_P_POWER_STATE_OCCUPANCY_CORES_C6", + "PerPkg": "1", + "PublicDescription": "Number of cores in C6 : This is an occupancy= event that tracks the number of cores that are in the chosen C-State. It = can be used by itself to get the average number of cores in that C-state wi= th thresholding to generate histograms, or with other PCU events and occupa= ncy triggering to capture other details.", + "Unit": "PCU" + }, + { + "BriefDescription": "External Prochot", + "Counter": "0,1,2,3", + "EventCode": "0x0a", + "EventName": "UNC_P_PROCHOT_EXTERNAL_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "External Prochot : Counts the number of cycl= es that we are in external PROCHOT mode. This mode is triggered when a sen= sor off the die determines that something off-die (like DRAM) is too hot an= d must throttle to avoid damaging the chip.", + "Unit": "PCU" + }, + { + "BriefDescription": "Internal Prochot", + "Counter": "0,1,2,3", + "EventCode": "0x09", + "EventName": "UNC_P_PROCHOT_INTERNAL_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Internal Prochot : Counts the number of cycl= es that we are in Internal PROCHOT mode. This mode is triggered when a sen= sor on the die determines that we are too hot and must throttle to avoid da= maging the chip.", + "Unit": "PCU" } ] diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 99b3101c4e4e..f3764805a700 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -13,7 +13,7 @@ GenuineIntel-6-CF,v1.11,emeraldrapids,core GenuineIntel-6-5[CF],v13,goldmont,core GenuineIntel-6-7A,v1.01,goldmontplus,core GenuineIntel-6-B6,v1.05,grandridge,core -GenuineIntel-6-A[DE],v1.02,graniterapids,core +GenuineIntel-6-A[DE],v1.04,graniterapids,core GenuineIntel-6-(3C|45|46),v35,haswell,core GenuineIntel-6-3F,v28,haswellx,core GenuineIntel-6-7[DE],v1.22,icelake,core --=20 2.48.0.rc2.279.g1de40edade-goog From nobody Sun Dec 14 13:49:51 2025 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8F83C197A8B for ; Thu, 16 Jan 2025 06:44:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009887; cv=none; b=TxR2kcOsQJdcNnClwARM1ZCmn4+egc+t1T9hp/EGumsMA10sBnOaUpNAmqS1Q4RhnQKxdsxIkSENLhCxlS3AwyRA0ndccaVceZwhLwNnm9+iC3QlURu8gC3AuOlMj2I75mbU9jLxN/AahH9TymXOwsI/3zaQ1wic0rx35rzIgZo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009887; c=relaxed/simple; bh=471SF0oqqwEGMgVgTTji0MC1Tv8FgAcSTqaXmdFHnDY=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=ezFxsc7pys1DXVBUrXfV2FbKQSLTbiC9Dxu5lXXoC0pO5geKPZLyuFz/nxkGHxKJwNbDyPjb6lxy/CJbADEIrEhNhkbzRQeW+ewMO8VfajxMJLZiF3VQX+VqNT+ZabQpH9ag+hNCeWX8VxdrHCyL1jcl+yyB3toJYYE7cKslFvA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=bI1wuDlh; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="bI1wuDlh" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e5738e8d116so1271658276.0 for ; Wed, 15 Jan 2025 22:44:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737009874; x=1737614674; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=1W2Q99mrvjBEeb7QuWSofaR91a+TolRQWKS9Gtq0KF8=; b=bI1wuDlh4km0LOpeqQOpXJTb5uZGVDszBt2Du3hYxCgDDHs263xswff9zpodFINdTl Uu2G8vQqbwxri3IV+LT7CtcM2MBxmlobanMOmAx1yc4LJCjjrfU3PtEEAW1wW4D9hNEF JpanfnTjv8es2YezQ77jcuZup1wzg898d2vLw8mwHCu1z9vVYoVoGL5slqJPI6ZHMlYk /hG9CHYtKsB0mS6FdzsW2/stbpe+SM20FuVJidZAQXTL/8fpyJRNb9EW7rnqf33AArXp TObRTU//7VtZ9zzSPf5ueDLypaZiff2w8UpTZI6P/GLExpiWaPFFGPSf2rqJqVH9WO66 L/Gw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737009874; x=1737614674; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=1W2Q99mrvjBEeb7QuWSofaR91a+TolRQWKS9Gtq0KF8=; b=F0Frdk8jNpaEzcSJSTUD4q4PqQHJOTHBBt+kt1xAaTD5e7JVnJnf3AlqGhM6QiuDgd brPzPyJPj7K836B9DzpUioMegi0a32rMZhTMGUeVQMXiPtRB81eG4qFdW7gp+EaErFmA bHYW+Y9Qbwf4pnVk75IjkYkSnP3+Tm9gk0/uwRALRfFJbyLpvcjvRIjM72lGaAL9QV0J bmTp1HjGVWkC0eBZ6VmyWsQRGFG5wp/R9tv3tLAxLQRJarB44QWAwpHVJhA5PA6Q5mEZ 2F2IhPgPHqhvo00ofY36JltVPRwrgW0YtCYrSW8oiMULKKruLvlD6ZTUur7TLP/+2ncI o7Wg== X-Forwarded-Encrypted: i=1; AJvYcCX6k72F94sRxxlQoIPQeIF7ZW+38BXRjpLmd7rKce4N4EaGq7J8bweLLlfh38U8D88UGoQ0p5oVibwFc74=@vger.kernel.org X-Gm-Message-State: AOJu0YxqqjdQsMOTRbcf2CqZiuSEGl33VYGuiG65dZHIV94+jOdbUVsk 62v0zLPcocu1tDIjh+SbigZ/HmSNx3IMs2J5XnLnlNA2pLk4uR9VpSCl1uePPjroZr8agLREun1 mQFEhhg== X-Google-Smtp-Source: AGHT+IGzEkvKUngNYv8ZkNJY55BOVpjbPw8/xqIh3787DUzSvNfhJ7pauxcXGjvPAmx9TGfUNcC12gmLJ+6N X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:7bd:ce19:c701:bfa3]) (user=irogers job=sendgmr) by 2002:a25:d652:0:b0:e48:ecb9:92d4 with SMTP id 3f1490d57ef6-e54ee1622cfmr63662276.3.1737009874020; Wed, 15 Jan 2025 22:44:34 -0800 (PST) Date: Wed, 15 Jan 2025 22:43:44 -0800 In-Reply-To: <20250116064355.344245-1-irogers@google.com> Message-Id: <20250116064355.344245-13-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250116064355.344245-1-irogers@google.com> X-Mailer: git-send-email 2.48.0.rc2.279.g1de40edade-goog Subject: [PATCH v2 12/23] perf vendor events: Update Haswell events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v35 to v36. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v36: https://github.com/intel/perfmon/commit/616ec6fc0315dac35c1bea0abc7f59e21a2= d51c0 The TMA 5.01 update is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-developed-by: Caleb Biggers Signed-off-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../arch/x86/haswell/hsw-metrics.json | 260 ++++++++++-------- .../pmu-events/arch/x86/haswell/memory.json | 2 +- .../arch/x86/haswell/metricgroups.json | 5 + tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 4 files changed, 151 insertions(+), 118 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/haswell/hsw-metrics.json b/tool= s/perf/pmu-events/arch/x86/haswell/hsw-metrics.json index b693c0b0cafe..0c1040b7e38c 100644 --- a/tools/perf/pmu-events/arch/x86/haswell/hsw-metrics.json +++ b/tools/perf/pmu-events/arch/x86/haswell/hsw-metrics.json @@ -74,12 +74,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", + "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT= .PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_= info_thread_slots", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", @@ -92,8 +92,8 @@ "MetricExpr": "66 * OTHER_ASSISTS.ANY_WB_ASSIST / tma_info_thread_= slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y_WB_ASSIST", "ScaleUnit": "100%" }, { @@ -104,7 +104,7 @@ "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound.", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound", "ScaleUnit": "100%" }, { @@ -114,7 +114,7 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%" }, { @@ -125,7 +125,7 @@ "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_info_bad_spec_branch_misprediction_cost, tma_mispredicts_resteers", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", "ScaleUnit": "100%" }, { @@ -133,8 +133,8 @@ "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS= .COUNT + BACLEARS.ANY) / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -143,18 +143,18 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(60 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM * (1 = + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_= UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS= _L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LO= AD_UOPS_RETIRED.L3_MISS))) + 43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS *= (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_L= OAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_= UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + ME= M_LOAD_UOPS_RETIRED.L3_MISS)))) / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MIS= S. Related metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears= ", "ScaleUnit": "100%" }, { @@ -165,7 +165,7 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", "ScaleUnit": "100%" }, { @@ -174,8 +174,8 @@ "MetricExpr": "43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT * (1 + = MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UO= PS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L= 3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD= _UOPS_RETIRED.L3_MISS))) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_UOPS_L3_HIT_RETIRED.XSNP_HIT. Related metrics: tma_contested_accesses, t= ma_false_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { @@ -183,7 +183,7 @@ "MetricExpr": "10 * ARITH.DIVIDER_UOPS / tma_info_core_core_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_UOPS", "ScaleUnit": "100%" }, @@ -193,8 +193,8 @@ "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_= RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS)) * CYCLE_ACTIVITY.STALL= S_L2_PENDING / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RE= TIRED.L3_MISS_PS", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RE= TIRED.L3_MISS", "ScaleUnit": "100%" }, { @@ -203,7 +203,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", "ScaleUnit": "100%" }, { @@ -211,7 +211,7 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Related metrics: tma_fetch_bandw= idth, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, @@ -220,8 +220,8 @@ "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + DTLB_LOAD_MISSES.W= ALK_DURATION) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= dtlb_store", "ScaleUnit": "100%" }, { @@ -229,27 +229,27 @@ "MetricExpr": "(8 * DTLB_STORE_MISSES.STLB_HIT + DTLB_STORE_MISSES= .WALK_DURATION) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_UOPS_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_UOPS_RETIRED.STLB_MISS_STORES. Related metrics: tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", "MetricExpr": "60 * OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.HITM_OTHER_= CORE / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related= metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_UOPS_L3= _HIT_RETIRED.XSNP_HITM, OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.HITM_OTHER_CORE.= Related metrics: tma_contested_accesses, tma_data_sharing, tma_machine_cle= ars", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.REQUEST_FB_FULL\\,cmask\\=3D1@ / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.REQUEST_FB_FULL\\,cmask\\=3D0x1@ / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_system_dram_bw_use, tma_mem_ba= ndwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_system_dram_bw_use, tma_mem_ba= ndwidth, tma_sq_full, tma_store_latency", "ScaleUnit": "100%" }, { @@ -279,33 +279,33 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound.", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", "MetricExpr": "tma_microcode_sequencer", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses.", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", "MetricExpr": "ICACHE.IFDATA_STALL / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", "MetricExpr": "tma_info_inst_mix_instructions / (UOPS_RETIRED.RETI= RE_SLOTS / UOPS_ISSUED.ANY * BR_MISP_EXEC.INDIRECT)", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" }, { "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", @@ -316,7 +316,7 @@ }, { "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTE= D.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CP= U_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_thread_clks))", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tm= a_info_thread_clks)", "MetricGroup": "SMT", "MetricName": "tma_info_core_core_clks" }, @@ -328,7 +328,7 @@ }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "(UOPS_EXECUTED.CORE / 2 / (cpu@UOPS_EXECUTED.CORE\\= ,cmask\\=3D1@ / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@) if= #SMT_on else UOPS_EXECUTED.CORE / (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ /= 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@))", + "MetricExpr": "(UOPS_EXECUTED.CORE / 2 / (cpu@UOPS_EXECUTED.CORE\\= ,cmask\\=3D0x1@ / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@= ) if #SMT_on else UOPS_EXECUTED.CORE / (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D= 0x1@ / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@))", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -347,7 +347,13 @@ "MetricName": "tma_info_frontend_ipunknown_branch" }, { - "BriefDescription": "Branch instructions per taken branch.", + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, + { + "BriefDescription": "Branch instructions per taken branch", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -392,7 +398,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 9", + "MetricThreshold": "tma_info_inst_mix_iptb < 4 * 2 + 1", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_frontend_dsb_coverage, = tma_lcp" }, { @@ -415,7 +421,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw" }, @@ -427,7 +433,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw" }, @@ -445,7 +451,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw" }, @@ -464,7 +470,7 @@ { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency" }, { @@ -496,14 +502,14 @@ "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -521,7 +527,7 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / duration_time / 1e3", + "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / tma_info_system_time / 1e3", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_mem_bandwidth,= tma_sq_full" @@ -531,13 +537,14 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi" + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -546,6 +553,19 @@ "MetricName": "tma_info_system_kernel_utilization", "MetricThreshold": "tma_info_system_kernel_utilization > 0.05" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "power@energy\\-pkg@ * 61 / (tma_info_system_time * = 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" + }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_= UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", @@ -558,6 +578,13 @@ "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -565,7 +592,7 @@ "MetricName": "tma_info_system_turbo_utilization" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -574,7 +601,8 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi" + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -600,24 +628,24 @@ "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TA= KEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 6" + "MetricThreshold": "tma_info_thread_uptb < 4 * 1.5" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + ITLB_MISSES.WALK_DURAT= ION) / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: ITLB_M= ISSES.WALK_COMPLETED", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", "MetricExpr": "max((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.ST= ALLS_LDM_PENDING) - CYCLE_ACTIVITY.STALLS_L1D_PENDING) / tma_info_thread_cl= ks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT_PS;MEM_LOAD_UOPS_= RETIRED.HIT_LFB_PS. Related metrics: tma_clears_resteers, tma_machine_clear= s, tma_microcode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT. Related metri= cs: tma_machine_clears, tma_microcode_sequencer, tma_ms_switches, tma_ports= _utilized_1", "ScaleUnit": "100%" }, { @@ -625,8 +653,8 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L1D_PENDING - CYCLE_ACTIVITY= .STALLS_L2_PENDING) / tma_info_thread_clks", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_UOPS_RETIRED.L2_HIT_PS", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_UOPS_RETIRED.L2_HIT", "ScaleUnit": "100%" }, { @@ -635,8 +663,8 @@ "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIR= ED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS) * CYCLE_ACTIVITY.STALLS_L2_P= ENDING / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT", "ScaleUnit": "100%" }, { @@ -645,8 +673,8 @@ "MetricExpr": "29 * (MEM_LOAD_UOPS_RETIRED.L3_HIT * (1 + MEM_LOAD_= UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRE= D.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RET= IRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_RET= IRED.L3_MISS))) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS. Related metric= s: tma_mem_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT. Related metrics: = tma_branch_resteers, tma_mem_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -654,18 +682,18 @@ "MetricExpr": "ILD_STALL.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_frontend_dsb_coverage,= tma_info_inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", "MetricExpr": "tma_retiring - tma_heavy_operations", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", "ScaleUnit": "100%" }, { @@ -682,10 +710,10 @@ "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_= STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_W= ITH_DEMAND_RFO) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOA= DS_PS. Related metrics: tma_store_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, { @@ -696,15 +724,15 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, = tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D6@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x6@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_system_dram_bw_u= se, tma_sq_full", "ScaleUnit": "100%" }, @@ -713,19 +741,19 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_l3_hit_latency", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALL= S_LDM_PENDING) + RESOURCE_STALLS.SB) / (min(CPU_CLK_UNHALTED.THREAD, CYCLE_= ACTIVITY.CYCLES_NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ - (cpu= @UOPS_EXECUTED.CORE\\,cmask\\=3D3@ if tma_info_thread_ipc > 1.8 else cpu@UO= PS_EXECUTED.CORE\\,cmask\\=3D2@)) / 2 - (RS_EVENTS.EMPTY_CYCLES if tma_fetc= h_latency > 0.1 else 0) + RESOURCE_STALLS.SB) if #SMT_on else min(CPU_CLK_U= NHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.CORE\= \,cmask\\=3D1@ - (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@ if tma_info_thread_= ipc > 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@) - (RS_EVENTS.EMPTY_CY= CLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) * tma_backend= _bound", + "MetricExpr": "(min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS= _LDM_PENDING) + RESOURCE_STALLS.SB) / (min(CPU_CLK_UNHALTED.THREAD, CYCLE_A= CTIVITY.CYCLES_NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@ - (cp= u@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@ if tma_info_thread_ipc > 1.8 else cpu= @UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@)) / 2 - (RS_EVENTS.EMPTY_CYCLES if tma= _fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB if #SMT_on else min(CPU_C= LK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.C= ORE\\,cmask\\=3D0x1@ - (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@ if tma_info= _thread_ipc > 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@) - (RS_EVENT= S.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) * t= ma_backend_bound", "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_b= ound_group", "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", "ScaleUnit": "100%" }, { @@ -734,7 +762,7 @@ "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", "MetricName": "tma_microcode_sequencer", "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", - "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers,= tma_l1_bound, tma_machine_clears, tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_l1_bound, tma_ma= chine_clears, tma_ms_switches", "ScaleUnit": "100%" }, { @@ -743,7 +771,7 @@ "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", "MetricName": "tma_mite", "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck", "ScaleUnit": "100%" }, { @@ -751,8 +779,8 @@ "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mix= ing_vectors, tma_serializing_operation", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_l1_bound= , tma_machine_clears, tma_microcode_sequencer", "ScaleUnit": "100%" }, { @@ -761,7 +789,7 @@ "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_= scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vecto= r_512b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_por= t_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -770,7 +798,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vect= or_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tm= a_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_port_0, tma_port_5, tma_port_6, tma_p= orts_utilized_2", "ScaleUnit": "100%" }, { @@ -806,7 +834,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_5", "MetricThreshold": "tma_port_5 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar,= tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_5. Related metrics: tma_port_= 0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -815,7 +843,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_6. Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector= _512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_1. Related metrics: tma_port= _0, tma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -830,46 +858,46 @@ { "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES= _NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ - (cpu@UOPS_EXECUTED.= CORE\\,cmask\\=3D3@ if tma_info_thread_ipc > 1.8 else cpu@UOPS_EXECUTED.COR= E\\,cmask\\=3D2@)) / 2 - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1= else 0) + RESOURCE_STALLS.SB if #SMT_on else min(CPU_CLK_UNHALTED.THREAD, = CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ -= (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@ if tma_info_thread_ipc > 1.8 else c= pu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@) - (RS_EVENTS.EMPTY_CYCLES if tma_fetc= h_latency > 0.1 else 0) + RESOURCE_STALLS.SB - RESOURCE_STALLS.SB - min(CPU= _CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING)) / tma_info_thread= _clks", + "MetricExpr": "((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLE= S_NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@ - (cpu@UOPS_EXECUT= ED.CORE\\,cmask\\=3D0x3@ if tma_info_thread_ipc > 1.8 else cpu@UOPS_EXECUTE= D.CORE\\,cmask\\=3D0x2@)) / 2 - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latenc= y > 0.1 else 0) + RESOURCE_STALLS.SB if #SMT_on else min(CPU_CLK_UNHALTED.T= HREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.CORE\\,cmask\\= =3D0x1@ - (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@ if tma_info_thread_ipc >= 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@) - (RS_EVENTS.EMPTY_CYCLE= S if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) - RESOURCE_STALL= S.SB - min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING)) / t= ma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\,cmask\\=3D1@ / 2 if= #SMT_on else (min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUT= E) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0)) / tma_info= _core_core_clks)", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\=3D0x1\\,cmask\\=3D0= x1@ / 2 if #SMT_on else min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_= NO_EXECUTE) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0)) /= tma_info_core_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D2@) / 2 if #SMT_on else (cpu@UOPS_EXECUTED.CORE\\= ,cmask\\=3D1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@) / tma_info_core_core= _clks)", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@ - cpu@UOP= S_EXECUTED.CORE\\,cmask\\=3D0x2@) / 2 if #SMT_on else cpu@UOPS_EXECUTED.COR= E\\,cmask\\=3D0x1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@) / tma_info_co= re_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D3@) / 2 if #SMT_on else (cpu@UOPS_EXECUTED.CORE\\= ,cmask\\=3D2@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@) / tma_info_core_core= _clks)", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@ - cpu@UOP= S_EXECUTED.CORE\\,cmask\\=3D0x3@) / 2 if #SMT_on else cpu@UOPS_EXECUTED.COR= E\\,cmask\\=3D0x2@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@) / tma_info_co= re_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port= _6", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise).", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@ / 2 if #SMT_= on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@) / tma_info_core_core_clks", + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@ / 2 if #SM= T_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@) / tma_info_core_core_clk= s", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -888,8 +916,8 @@ "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS_PS", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS", "ScaleUnit": "100%" }, { @@ -897,16 +925,16 @@ "MetricExpr": "2 * MEM_UOPS_RETIRED.SPLIT_STORES / tma_info_core_c= ore_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_UOPS_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_UOPS_RETIRED.SPLIT_STORES. Related metrics: tma_port_4", "ScaleUnit": "100%" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on els= e OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_core_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_system_dram_bw_use, tma_mem_bandwidth", "ScaleUnit": "100%" }, @@ -915,8 +943,8 @@ "MetricExpr": "RESOURCE_STALLS.SB / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_UOPS_RETIRED.ALL_STORES_PS", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_UOPS_RETIRED.ALL_STORES", "ScaleUnit": "100%" }, { @@ -924,18 +952,18 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_= LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / M= EM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS= _OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%" }, { diff --git a/tools/perf/pmu-events/arch/x86/haswell/memory.json b/tools/per= f/pmu-events/arch/x86/haswell/memory.json index edb1b5b9f553..a76f1862e5ea 100644 --- a/tools/perf/pmu-events/arch/x86/haswell/memory.json +++ b/tools/perf/pmu-events/arch/x86/haswell/memory.json @@ -412,7 +412,7 @@ "Counter": "0,1,2,3", "EventCode": "0xc9", "EventName": "RTM_RETIRED.ABORTED", - "PEBS": "2", + "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x4" }, diff --git a/tools/perf/pmu-events/arch/x86/haswell/metricgroups.json b/too= ls/perf/pmu-events/arch/x86/haswell/metricgroups.json index 4193c90c3459..0863375bdead 100644 --- a/tools/perf/pmu-events/arch/x86/haswell/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/haswell/metricgroups.json @@ -9,6 +9,7 @@ "BvCB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvFB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvIO": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvML": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvMP": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvMS": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", @@ -34,6 +35,7 @@ "InsType": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -51,6 +53,7 @@ "Pipeline": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "PortsUtil": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", "Power": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "Prefetches": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", "Ret": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", "Retire": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", "SMT": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -78,6 +81,7 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -103,6 +107,7 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index f3764805a700..587d8148da2f 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -14,7 +14,7 @@ GenuineIntel-6-5[CF],v13,goldmont,core GenuineIntel-6-7A,v1.01,goldmontplus,core GenuineIntel-6-B6,v1.05,grandridge,core GenuineIntel-6-A[DE],v1.04,graniterapids,core -GenuineIntel-6-(3C|45|46),v35,haswell,core +GenuineIntel-6-(3C|45|46),v36,haswell,core GenuineIntel-6-3F,v28,haswellx,core GenuineIntel-6-7[DE],v1.22,icelake,core GenuineIntel-6-6[AC],v1.26,icelakex,core --=20 2.48.0.rc2.279.g1de40edade-goog From nobody Sun Dec 14 13:49:51 2025 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EF312198842 for ; Thu, 16 Jan 2025 06:44:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009895; cv=none; b=HzfCGqVjQCNBPB8raNGjCgFxyfi65bo8/IxxFa4fkCZ4ZPreVaffOiGDnlaeFLWecdtZPnKsAnssTZBeCOQmfVnJPQAmY1xv5VayxBxw6MY8W/knJ8VyRjoU4p3R+XHIyILD4jrr2JWOatA6RYtVZDw2hMoHhbTj+fMUW9vEyj4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009895; c=relaxed/simple; bh=M41HZcZNU6GYIFyL1SALwsSsFvM2Erjtw1I7jc/5WKI=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=cDCNjgyU+8LUwqOELNO1RsJ1FjA+Z2uAzDXMZbuXQktOY/buTrYmoyYxg7GlTWcT3VwiZMUsl8Kc2z+RzhFhnpt9RMYvMTjyZD67wIv9o7TJgOyhEBbm1cMNZiG4K5n72W2yK8qGhXEkN1NjgSuIohgFtvAHwhOOU0VgqkcjmAc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=hS9aw/RJ; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="hS9aw/RJ" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-e549de22484so1602143276.2 for ; Wed, 15 Jan 2025 22:44:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737009877; x=1737614677; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=d7vyRe8kf3OVvv7UzbfhCmusr9l1McDgS2z+OEgd7Ac=; b=hS9aw/RJco+FuM9bgxUVgyiI51Bk31OKWie2yL28jE90T7pcKpsWXvLgot07ohMLNp nGI/AzhSNfNYXd4HPDycKb4J6iWobaLB8Wh/vXRk7YEJ8sRJfbmRWsv0IAlNnU5D9qh9 2Ueb/nqPQlk+ggeJt5QdwA+6hYoMIfxWbrFVfya6YlXT5gjsJdjMeK4YfNjkLMD5qNir 8ikTAT68NpqOK8K0IEsIWovvggU6DN3qjZR7CKbOUfcS3rikv4oeuXrn/uomVk1nZNYR ATKHziPT+EuT1x3qV6bQ5MjfSR6HHltXdEC5qXdVOz0fUdI4UQR+pljMev027r42c9cd yClw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737009877; x=1737614677; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=d7vyRe8kf3OVvv7UzbfhCmusr9l1McDgS2z+OEgd7Ac=; b=AX5c2Kn9YPT/Kn/CzxQf84RiFh/9gu7JRB/ZpUVTZM7aklXTcG3GBp0eEs1Y9JWV3O 6ANJvEn1aCbeZ+mx/eP2moxuxHeZ1rdBj5owZi93ec1sys0+oYAbbIqv9b9DrErZgmSp 5CQG+gcGMMIrkTstO4XqhJgTMtROBFRDx01n7jGDn77a07h02P1hWxppBwuKyB+P3VqC 3paZXrJzd3gqEvJXtPQuHAT1UjELewHIRrCrHADcuZwuWjncnKdI9fyN0NGzBAy4bIxN vyemKF9/+fv0s8YlM3FeXCW0XFtvz8tz3LeYkyME7PNMxCxSk7mhB7dOPrWElcWuq0TV 7aIQ== X-Forwarded-Encrypted: i=1; AJvYcCVoh3SwKfxxmZQgfdFplIz+6zRMJyFTxZOOQPJwZ8ZeFDV9IToppZLvCcPXNdgxDl9tsMZFmqNYRj6iMtc=@vger.kernel.org X-Gm-Message-State: AOJu0Yw6Iy5aYZ6x0cSmzbBFtJeur9xxYZs2AZJaZQV9dKDLIxVYAJOF rwRQsc1alrwFp3DpXHt2PKGjuVMZ1QuTCUNqQFSMTWvHqEzJg1AbDg5CxFFEd1C4/RPJ1vhDjQr ZdyLIag== X-Google-Smtp-Source: AGHT+IG81xbZaXn4FJiocujuZSWTj/xNFjPfqbEyOdH0zcqlJF9Thdlppmxf21pUEkpS791kEo8LbeOnW4mm X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:7bd:ce19:c701:bfa3]) (user=irogers job=sendgmr) by 2002:a81:a904:0:b0:6ef:70fb:459b with SMTP id 00721157ae682-6f530eea8a4mr723147b3.0.1737009876597; Wed, 15 Jan 2025 22:44:36 -0800 (PST) Date: Wed, 15 Jan 2025 22:43:45 -0800 In-Reply-To: <20250116064355.344245-1-irogers@google.com> Message-Id: <20250116064355.344245-14-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250116064355.344245-1-irogers@google.com> X-Mailer: git-send-email 2.48.0.rc2.279.g1de40edade-goog Subject: [PATCH v2 13/23] perf vendor events: Update HaswellX events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v28 to v29. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v29: https://github.com/intel/perfmon/commit/71dbf03aba964f79fb096c9ded385c8a486= a99b3 The TMA 5.01 update is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-developed-by: Caleb Biggers Signed-off-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../arch/x86/haswellx/hsx-metrics.json | 296 ++++++++++-------- .../arch/x86/haswellx/metricgroups.json | 5 + .../arch/x86/haswellx/uncore-cache.json | 28 +- .../x86/haswellx/uncore-interconnect.json | 38 +-- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 5 files changed, 191 insertions(+), 178 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/haswellx/hsx-metrics.json b/too= ls/perf/pmu-events/arch/x86/haswellx/hsx-metrics.json index 8f2ba3391e35..1a05b74be575 100644 --- a/tools/perf/pmu-events/arch/x86/haswellx/hsx-metrics.json +++ b/tools/perf/pmu-events/arch/x86/haswellx/hsx-metrics.json @@ -55,7 +55,7 @@ "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles.", + "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles", "MetricExpr": "CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY", "MetricName": "cpi", "ScaleUnit": "1per_instr" @@ -76,24 +76,24 @@ "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data loads to the total number of complete= d instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "dtlb_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data stores to the total number of complet= ed instructions", "MetricExpr": "DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", "MetricName": "dtlb_store_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB", "ScaleUnit": "1per_instr" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU.", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU", "MetricExpr": "cbox@UNC_C_TOR_INSERTS.OPCODE\\,filter_opc\\=3D0x19= e@ * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU", "MetricExpr": "cbox@UNC_C_TOR_INSERTS.OPCODE\\,filter_opc\\=3D0x1c= 8\\,filter_tid\\=3D0x3e@ * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_write", "ScaleUnit": "1MB/s" @@ -102,14 +102,14 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total n= umber of completed instructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY= ", "MetricName": "itlb_large_page_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by a code fetch to the total number of completed ins= tructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "itlb_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB", "ScaleUnit": "1per_instr" }, { @@ -209,13 +209,13 @@ "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches.", + "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches", "MetricExpr": "cbox@UNC_C_TOR_INSERTS.MISS_LOCAL_OPCODE\\,filter_o= pc\\=3D0x182@ / (cbox@UNC_C_TOR_INSERTS.MISS_LOCAL_OPCODE\\,filter_opc\\=3D= 0x182@ + cbox@UNC_C_TOR_INSERTS.MISS_REMOTE_OPCODE\\,filter_opc\\=3D0x182@)= ", "MetricName": "numa_reads_addressed_to_local_dram", "ScaleUnit": "100%" }, { - "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches.", + "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches", "MetricExpr": "cbox@UNC_C_TOR_INSERTS.MISS_REMOTE_OPCODE\\,filter_= opc\\=3D0x182@ / (cbox@UNC_C_TOR_INSERTS.MISS_LOCAL_OPCODE\\,filter_opc\\= =3D0x182@ + cbox@UNC_C_TOR_INSERTS.MISS_REMOTE_OPCODE\\,filter_opc\\=3D0x18= 2@)", "MetricName": "numa_reads_addressed_to_remote_dram", "ScaleUnit": "100%" @@ -276,12 +276,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", + "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT= .PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_= info_thread_slots", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", @@ -294,8 +294,8 @@ "MetricExpr": "66 * OTHER_ASSISTS.ANY_WB_ASSIST / tma_info_thread_= slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y_WB_ASSIST", "ScaleUnit": "100%" }, { @@ -306,7 +306,7 @@ "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound.", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound", "ScaleUnit": "100%" }, { @@ -316,7 +316,7 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%" }, { @@ -327,7 +327,7 @@ "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_info_bad_spec_branch_misprediction_cost, tma_mispredicts_resteers", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES", "ScaleUnit": "100%" }, { @@ -335,8 +335,8 @@ "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS= .COUNT + BACLEARS.ANY) / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -345,18 +345,18 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(60 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM * (1 = + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_= UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS= _L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LO= AD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_D= RAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RET= IRED.REMOTE_FWD))) + 43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS * (1 + ME= M_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS= _RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_= HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_U= OPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM = + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED= .REMOTE_FWD)))) / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MIS= S. Related metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears= , tma_remote_cache", "ScaleUnit": "100%" }, { @@ -367,7 +367,7 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", "ScaleUnit": "100%" }, { @@ -376,8 +376,8 @@ "MetricExpr": "43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT * (1 + = MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UO= PS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L= 3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD= _UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRA= M + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIR= ED.REMOTE_FWD))) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_UOPS_L3_HIT_RETIRED.XSNP_HIT. Related metrics: tma_contested_accesses, t= ma_false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { @@ -385,7 +385,7 @@ "MetricExpr": "10 * ARITH.DIVIDER_UOPS / tma_info_core_core_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_UOPS", "ScaleUnit": "100%" }, @@ -395,8 +395,8 @@ "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_= RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS)) * CYCLE_ACTIVITY.STALL= S_L2_PENDING / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RE= TIRED.L3_MISS_PS", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RE= TIRED.L3_MISS", "ScaleUnit": "100%" }, { @@ -405,7 +405,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", "ScaleUnit": "100%" }, { @@ -413,7 +413,7 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Related metrics: tma_fetch_bandw= idth, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, @@ -422,8 +422,8 @@ "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + DTLB_LOAD_MISSES.W= ALK_DURATION) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= dtlb_store", "ScaleUnit": "100%" }, { @@ -431,27 +431,27 @@ "MetricExpr": "(8 * DTLB_STORE_MISSES.STLB_HIT + DTLB_STORE_MISSES= .WALK_DURATION) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_UOPS_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_UOPS_RETIRED.STLB_MISS_STORES. Related metrics: tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", "MetricExpr": "(200 * OFFCORE_RESPONSE.DEMAND_RFO.LLC_MISS.REMOTE_= HITM + 60 * OFFCORE_RESPONSE.DEMAND_RFO.LLC_HIT.HITM_OTHER_CORE) / tma_info= _thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related= metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_UOPS_L3= _HIT_RETIRED.XSNP_HITM, MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM, OFFCORE_= RESPONSE.DEMAND_RFO.LLC_HIT.HITM_OTHER_CORE, OFFCORE_RESPONSE.DEMAND_RFO.LL= C_MISS.REMOTE_HITM. Related metrics: tma_contested_accesses, tma_data_shari= ng, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.REQUEST_FB_FULL\\,cmask\\=3D1@ / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.REQUEST_FB_FULL\\,cmask\\=3D0x1@ / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_system_dram_bw_use, tma_mem_ba= ndwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_system_dram_bw_use, tma_mem_ba= ndwidth, tma_sq_full, tma_store_latency", "ScaleUnit": "100%" }, { @@ -481,33 +481,33 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound.", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", "MetricExpr": "tma_microcode_sequencer", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses.", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", "MetricExpr": "ICACHE.IFDATA_STALL / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", "MetricExpr": "tma_info_inst_mix_instructions / (UOPS_RETIRED.RETI= RE_SLOTS / UOPS_ISSUED.ANY * BR_MISP_EXEC.INDIRECT)", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" }, { "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", @@ -518,7 +518,7 @@ }, { "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTE= D.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CP= U_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_thread_clks))", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tm= a_info_thread_clks)", "MetricGroup": "SMT", "MetricName": "tma_info_core_core_clks" }, @@ -530,7 +530,7 @@ }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "(UOPS_EXECUTED.CORE / 2 / (cpu@UOPS_EXECUTED.CORE\\= ,cmask\\=3D1@ / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@) if= #SMT_on else UOPS_EXECUTED.CORE / (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ /= 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@))", + "MetricExpr": "(UOPS_EXECUTED.CORE / 2 / (cpu@UOPS_EXECUTED.CORE\\= ,cmask\\=3D0x1@ / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@= ) if #SMT_on else UOPS_EXECUTED.CORE / (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D= 0x1@ / 2 if #SMT_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@))", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -549,7 +549,13 @@ "MetricName": "tma_info_frontend_ipunknown_branch" }, { - "BriefDescription": "Branch instructions per taken branch.", + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, + { + "BriefDescription": "Branch instructions per taken branch", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -594,7 +600,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 9", + "MetricThreshold": "tma_info_inst_mix_iptb < 4 * 2 + 1", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_frontend_dsb_coverage, = tma_lcp" }, { @@ -617,7 +623,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw" }, @@ -629,7 +635,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw" }, @@ -647,7 +653,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw" }, @@ -666,7 +672,7 @@ { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency" }, { @@ -698,14 +704,14 @@ "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -723,7 +729,7 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / duration_time", + "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / tma_info_system_time", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_mem_bandwidth,= tma_sq_full" @@ -733,13 +739,14 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi" + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -750,18 +757,31 @@ }, { "BriefDescription": "Average number of parallel data read requests= to external memory", - "MetricExpr": "UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=3D0x18= 2@ / UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=3D0x182\\,thresh\\=3D1@", + "MetricExpr": "cbox@UNC_C_TOR_OCCUPANCY.MISS_OPCODE\\,filter_opc\\= =3D0x182@ / cbox@UNC_C_TOR_OCCUPANCY.MISS_OPCODE\\,filter_opc\\=3D0x182@", "MetricGroup": "Mem;MemoryBW;SoC", "MetricName": "tma_info_system_mem_parallel_reads", "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches" }, { "BriefDescription": "Average latency of data read request to exter= nal memory (in nanoseconds)", - "MetricExpr": "1e9 * (UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\= =3D0x182@ / UNC_C_TOR_INSERTS.MISS_OPCODE@filter_opc\\=3D0x182@) / (tma_inf= o_system_socket_clks / duration_time)", + "MetricExpr": "1e9 * (cbox@UNC_C_TOR_OCCUPANCY.MISS_OPCODE\\,filte= r_opc\\=3D0x182@ / cbox@UNC_C_TOR_INSERTS.MISS_OPCODE\\,filter_opc\\=3D0x18= 2@) / (tma_info_system_socket_clks / tma_info_system_time)", "MetricGroup": "Mem;MemoryLat;SoC", "MetricName": "tma_info_system_mem_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetche= s. ([RKL+]memory-controller only)" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "(power@energy\\-pkg@ * 61 + 15.6 * power@energy\\-r= am@) / (duration_time * 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" + }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_= UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", @@ -774,6 +794,13 @@ "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -782,12 +809,12 @@ }, { "BriefDescription": "Measured Average Uncore Frequency for the SoC= [GHz]", - "MetricExpr": "tma_info_system_socket_clks / 1e9 / duration_time", + "MetricExpr": "tma_info_system_socket_clks / 1e9 / tma_info_system= _time", "MetricGroup": "SoC", "MetricName": "tma_info_system_uncore_frequency" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -796,7 +823,8 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi" + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -822,24 +850,24 @@ "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TA= KEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 6" + "MetricThreshold": "tma_info_thread_uptb < 4 * 1.5" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + ITLB_MISSES.WALK_DURAT= ION) / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: ITLB_M= ISSES.WALK_COMPLETED", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", "MetricExpr": "max((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.ST= ALLS_LDM_PENDING) - CYCLE_ACTIVITY.STALLS_L1D_PENDING) / tma_info_thread_cl= ks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT_PS;MEM_LOAD_UOPS_= RETIRED.HIT_LFB_PS. Related metrics: tma_clears_resteers, tma_machine_clear= s, tma_microcode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT. Related metri= cs: tma_machine_clears, tma_microcode_sequencer, tma_ms_switches, tma_ports= _utilized_1", "ScaleUnit": "100%" }, { @@ -847,8 +875,8 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L1D_PENDING - CYCLE_ACTIVITY= .STALLS_L2_PENDING) / tma_info_thread_clks", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_UOPS_RETIRED.L2_HIT_PS", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_UOPS_RETIRED.L2_HIT", "ScaleUnit": "100%" }, { @@ -857,8 +885,8 @@ "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIR= ED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS) * CYCLE_ACTIVITY.STALLS_L2_P= ENDING / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT", "ScaleUnit": "100%" }, { @@ -867,8 +895,8 @@ "MetricExpr": "41 * (MEM_LOAD_UOPS_RETIRED.L3_HIT * (1 + MEM_LOAD_= UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRE= D.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RET= IRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_= MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_L= OAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE= _FWD))) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS. Related metric= s: tma_mem_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT. Related metrics: = tma_branch_resteers, tma_mem_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -876,18 +904,18 @@ "MetricExpr": "ILD_STALL.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_frontend_dsb_coverage,= tma_info_inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", "MetricExpr": "tma_retiring - tma_heavy_operations", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", "ScaleUnit": "100%" }, { @@ -905,18 +933,18 @@ "MetricExpr": "200 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM * (= 1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOA= D_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UO= PS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_= LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE= _DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_R= ETIRED.REMOTE_FWD))) / tma_info_thread_clks", "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_grou= p", "MetricName": "tma_local_mem", - "MetricThreshold": "tma_local_mem > 0.1 & (tma_mem_latency > 0.1 &= (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)= ))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_L3= _MISS_RETIRED.LOCAL_DRAM_PS", + "MetricThreshold": "tma_local_mem > 0.1 & tma_mem_latency > 0.1 & = tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_L3= _MISS_RETIRED.LOCAL_DRAM", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_= STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_W= ITH_DEMAND_RFO) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOA= DS_PS. Related metrics: tma_store_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, { @@ -927,15 +955,15 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, = tma_microcode_sequencer, tma_ms_switches, tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D6@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x6@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_system_dram_bw_u= se, tma_sq_full", "ScaleUnit": "100%" }, @@ -944,19 +972,19 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_l3_hit_latency", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALL= S_LDM_PENDING) + RESOURCE_STALLS.SB) / (min(CPU_CLK_UNHALTED.THREAD, CYCLE_= ACTIVITY.CYCLES_NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ - (cpu= @UOPS_EXECUTED.CORE\\,cmask\\=3D3@ if tma_info_thread_ipc > 1.8 else cpu@UO= PS_EXECUTED.CORE\\,cmask\\=3D2@)) / 2 - (RS_EVENTS.EMPTY_CYCLES if tma_fetc= h_latency > 0.1 else 0) + RESOURCE_STALLS.SB) if #SMT_on else min(CPU_CLK_U= NHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.CORE\= \,cmask\\=3D1@ - (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@ if tma_info_thread_= ipc > 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@) - (RS_EVENTS.EMPTY_CY= CLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) * tma_backend= _bound", + "MetricExpr": "(min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS= _LDM_PENDING) + RESOURCE_STALLS.SB) / (min(CPU_CLK_UNHALTED.THREAD, CYCLE_A= CTIVITY.CYCLES_NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@ - (cp= u@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@ if tma_info_thread_ipc > 1.8 else cpu= @UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@)) / 2 - (RS_EVENTS.EMPTY_CYCLES if tma= _fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB if #SMT_on else min(CPU_C= LK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.C= ORE\\,cmask\\=3D0x1@ - (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@ if tma_info= _thread_ipc > 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@) - (RS_EVENT= S.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) * t= ma_backend_bound", "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_b= ound_group", "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", "ScaleUnit": "100%" }, { @@ -965,7 +993,7 @@ "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", "MetricName": "tma_microcode_sequencer", "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", - "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers,= tma_l1_bound, tma_machine_clears, tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_l1_bound, tma_ma= chine_clears, tma_ms_switches", "ScaleUnit": "100%" }, { @@ -974,7 +1002,7 @@ "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", "MetricName": "tma_mite", "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck", "ScaleUnit": "100%" }, { @@ -982,8 +1010,8 @@ "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mix= ing_vectors, tma_serializing_operation", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_l1_bound= , tma_machine_clears, tma_microcode_sequencer", "ScaleUnit": "100%" }, { @@ -992,7 +1020,7 @@ "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_= scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vecto= r_512b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_por= t_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1001,7 +1029,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vect= or_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tm= a_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_port_0, tma_port_5, tma_port_6, tma_p= orts_utilized_2", "ScaleUnit": "100%" }, { @@ -1037,7 +1065,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_5", "MetricThreshold": "tma_port_5 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar,= tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_5. Related metrics: tma_port_= 0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1046,7 +1074,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_6. Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector= _512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_1. Related metrics: tma_port= _0, tma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1061,46 +1089,46 @@ { "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES= _NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ - (cpu@UOPS_EXECUTED.= CORE\\,cmask\\=3D3@ if tma_info_thread_ipc > 1.8 else cpu@UOPS_EXECUTED.COR= E\\,cmask\\=3D2@)) / 2 - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1= else 0) + RESOURCE_STALLS.SB if #SMT_on else min(CPU_CLK_UNHALTED.THREAD, = CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ -= (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@ if tma_info_thread_ipc > 1.8 else c= pu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@) - (RS_EVENTS.EMPTY_CYCLES if tma_fetc= h_latency > 0.1 else 0) + RESOURCE_STALLS.SB - RESOURCE_STALLS.SB - min(CPU= _CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING)) / tma_info_thread= _clks", + "MetricExpr": "((min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLE= S_NO_EXECUTE) + (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@ - (cpu@UOPS_EXECUT= ED.CORE\\,cmask\\=3D0x3@ if tma_info_thread_ipc > 1.8 else cpu@UOPS_EXECUTE= D.CORE\\,cmask\\=3D0x2@)) / 2 - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latenc= y > 0.1 else 0) + RESOURCE_STALLS.SB if #SMT_on else min(CPU_CLK_UNHALTED.T= HREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUTE) + cpu@UOPS_EXECUTED.CORE\\,cmask\\= =3D0x1@ - (cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@ if tma_info_thread_ipc >= 1.8 else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@) - (RS_EVENTS.EMPTY_CYCLE= S if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB) - RESOURCE_STALL= S.SB - min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.STALLS_LDM_PENDING)) / t= ma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\,cmask\\=3D1@ / 2 if= #SMT_on else (min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_NO_EXECUT= E) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0)) / tma_info= _core_core_clks)", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\=3D0x1\\,cmask\\=3D0= x1@ / 2 if #SMT_on else min(CPU_CLK_UNHALTED.THREAD, CYCLE_ACTIVITY.CYCLES_= NO_EXECUTE) - (RS_EVENTS.EMPTY_CYCLES if tma_fetch_latency > 0.1 else 0)) /= tma_info_core_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D2@) / 2 if #SMT_on else (cpu@UOPS_EXECUTED.CORE\\= ,cmask\\=3D1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@) / tma_info_core_core= _clks)", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@ - cpu@UOP= S_EXECUTED.CORE\\,cmask\\=3D0x2@) / 2 if #SMT_on else cpu@UOPS_EXECUTED.COR= E\\,cmask\\=3D0x1@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@) / tma_info_co= re_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D3@) / 2 if #SMT_on else (cpu@UOPS_EXECUTED.CORE\\= ,cmask\\=3D2@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@) / tma_info_core_core= _clks)", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@ - cpu@UOP= S_EXECUTED.CORE\\,cmask\\=3D0x3@) / 2 if #SMT_on else cpu@UOPS_EXECUTED.COR= E\\,cmask\\=3D0x2@ - cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@) / tma_info_co= re_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port= _6", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise).", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@ / 2 if #SMT_= on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@) / tma_info_core_core_clks", + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@ / 2 if #SM= T_on else cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@) / tma_info_core_core_clk= s", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1109,8 +1137,8 @@ "MetricExpr": "(200 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM *= (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_L= OAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_= UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + ME= M_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMO= TE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS= _RETIRED.REMOTE_FWD))) + 180 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD * = (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LO= AD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_U= OPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM= _LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOT= E_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_= RETIRED.REMOTE_FWD)))) / tma_info_thread_clks", "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_is= sueSyncxn;tma_mem_latency_group", "MetricName": "tma_remote_cache", - "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0= .1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > = 0.2)))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_UOPS_L3_M= ISS_RETIRED.REMOTE_HITM_PS;MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD_PS. Rel= ated metrics: tma_contested_accesses, tma_data_sharing, tma_false_sharing, = tma_machine_clears", + "MetricThreshold": "tma_remote_cache > 0.05 & tma_mem_latency > 0.= 1 & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2= ", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. Sample with: MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM= , MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD. Related metrics: tma_contested_= accesses, tma_data_sharing, tma_false_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { @@ -1118,8 +1146,8 @@ "MetricExpr": "310 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM * = (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LO= AD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_U= OPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM= _LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOT= E_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_= RETIRED.REMOTE_FWD))) / tma_info_thread_clks", "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latenc= y_group", "MetricName": "tma_remote_mem", - "MetricThreshold": "tma_remote_mem > 0.1 & (tma_mem_latency > 0.1 = & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. #link to NUMA article. Sample= with: MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM_PS", + "MetricThreshold": "tma_remote_mem > 0.1 & tma_mem_latency > 0.1 &= tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. Sample with: MEM_LOAD_UOPS_L3= _MISS_RETIRED.REMOTE_DRAM", "ScaleUnit": "100%" }, { @@ -1138,8 +1166,8 @@ "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS_PS", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS", "ScaleUnit": "100%" }, { @@ -1147,16 +1175,16 @@ "MetricExpr": "2 * MEM_UOPS_RETIRED.SPLIT_STORES / tma_info_core_c= ore_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_UOPS_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_UOPS_RETIRED.SPLIT_STORES. Related metrics: tma_port_4", "ScaleUnit": "100%" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on els= e OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_core_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_system_dram_bw_use, tma_mem_bandwidth", "ScaleUnit": "100%" }, @@ -1165,8 +1193,8 @@ "MetricExpr": "RESOURCE_STALLS.SB / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_UOPS_RETIRED.ALL_STORES_PS", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_UOPS_RETIRED.ALL_STORES", "ScaleUnit": "100%" }, { @@ -1174,18 +1202,18 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_= LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / M= EM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS= _OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%" }, { diff --git a/tools/perf/pmu-events/arch/x86/haswellx/metricgroups.json b/to= ols/perf/pmu-events/arch/x86/haswellx/metricgroups.json index 4193c90c3459..0863375bdead 100644 --- a/tools/perf/pmu-events/arch/x86/haswellx/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/haswellx/metricgroups.json @@ -9,6 +9,7 @@ "BvCB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvFB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvIO": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvML": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvMP": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvMS": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", @@ -34,6 +35,7 @@ "InsType": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -51,6 +53,7 @@ "Pipeline": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "PortsUtil": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", "Power": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "Prefetches": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", "Ret": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", "Retire": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", "SMT": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -78,6 +81,7 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -103,6 +107,7 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", diff --git a/tools/perf/pmu-events/arch/x86/haswellx/uncore-cache.json b/to= ols/perf/pmu-events/arch/x86/haswellx/uncore-cache.json index 3c23bafcba28..d664af16c1db 100644 --- a/tools/perf/pmu-events/arch/x86/haswellx/uncore-cache.json +++ b/tools/perf/pmu-events/arch/x86/haswellx/uncore-cache.json @@ -812,7 +812,7 @@ "EventCode": "0x12", "EventName": "UNC_C_RxR_EXT_STARVED.IPQ", "PerPkg": "1", - "PublicDescription": "Counts cycles in external starvation. This = occurs when one of the ingress queues is being starved by the other queues.= ; IPQ is externally startved and therefore we are blocking the IRQ.", + "PublicDescription": "Counts cycles in external starvation. This = occurs when one of the ingress queues is being starved by the other queues.= ; IPQ is externally starved and therefore we are blocking the IRQ.", "UMask": "0x2", "Unit": "CBOX" }, @@ -1869,7 +1869,7 @@ "EventCode": "0x14", "EventName": "UNC_H_BYPASS_IMC.NOT_TAKEN", "PerPkg": "1", - "PublicDescription": "Counts the number of times when the HA was a= ble to bypass was attempted. This is a latency optimization for situations= when there is light loadings on the memory subsystem. This can be filted = by when the bypass was taken and when it was not.; Filter for transactions = that could not take the bypass.", + "PublicDescription": "Counts the number of times when the HA was a= ble to bypass was attempted. This is a latency optimization for situations= when there is light loadings on the memory subsystem. This can be filtere= d by when the bypass was taken and when it was not.; Filter for transaction= s that could not take the bypass.", "UMask": "0x2", "Unit": "HA" }, @@ -1879,7 +1879,7 @@ "EventCode": "0x14", "EventName": "UNC_H_BYPASS_IMC.TAKEN", "PerPkg": "1", - "PublicDescription": "Counts the number of times when the HA was a= ble to bypass was attempted. This is a latency optimization for situations= when there is light loadings on the memory subsystem. This can be filted = by when the bypass was taken and when it was not.; Filter for transactions = that succeeded in taking the bypass.", + "PublicDescription": "Counts the number of times when the HA was a= ble to bypass was attempted. This is a latency optimization for situations= when there is light loadings on the memory subsystem. This can be filtere= d by when the bypass was taken and when it was not.; Filter for transaction= s that succeeded in taking the bypass.", "UMask": "0x1", "Unit": "HA" }, @@ -2874,7 +2874,7 @@ "EventCode": "0x15", "EventName": "UNC_H_RPQ_CYCLES_NO_REG_CREDITS.CHN0", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high banwidth workloads should be able to make use of all of the regular b= uffers, but it will be difficult (and uncommon) to make use of both the reg= ular and special buffers at the same time. One can filter based on the mem= ory controller channel. One or more channels can be tracked at a given tim= e.; Filter for memory controller channel 0 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high bandwidth workloads should be able to make use of all of the regular = buffers, but it will be difficult (and uncommon) to make use of both the re= gular and special buffers at the same time. One can filter based on the me= mory controller channel. One or more channels can be tracked at a given ti= me.; Filter for memory controller channel 0 only.", "UMask": "0x1", "Unit": "HA" }, @@ -2884,7 +2884,7 @@ "EventCode": "0x15", "EventName": "UNC_H_RPQ_CYCLES_NO_REG_CREDITS.CHN1", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high banwidth workloads should be able to make use of all of the regular b= uffers, but it will be difficult (and uncommon) to make use of both the reg= ular and special buffers at the same time. One can filter based on the mem= ory controller channel. One or more channels can be tracked at a given tim= e.; Filter for memory controller channel 1 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high bandwidth workloads should be able to make use of all of the regular = buffers, but it will be difficult (and uncommon) to make use of both the re= gular and special buffers at the same time. One can filter based on the me= mory controller channel. One or more channels can be tracked at a given ti= me.; Filter for memory controller channel 1 only.", "UMask": "0x2", "Unit": "HA" }, @@ -2894,7 +2894,7 @@ "EventCode": "0x15", "EventName": "UNC_H_RPQ_CYCLES_NO_REG_CREDITS.CHN2", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high banwidth workloads should be able to make use of all of the regular b= uffers, but it will be difficult (and uncommon) to make use of both the reg= ular and special buffers at the same time. One can filter based on the mem= ory controller channel. One or more channels can be tracked at a given tim= e.; Filter for memory controller channel 2 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high bandwidth workloads should be able to make use of all of the regular = buffers, but it will be difficult (and uncommon) to make use of both the re= gular and special buffers at the same time. One can filter based on the me= mory controller channel. One or more channels can be tracked at a given ti= me.; Filter for memory controller channel 2 only.", "UMask": "0x4", "Unit": "HA" }, @@ -2904,7 +2904,7 @@ "EventCode": "0x15", "EventName": "UNC_H_RPQ_CYCLES_NO_REG_CREDITS.CHN3", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high banwidth workloads should be able to make use of all of the regular b= uffers, but it will be difficult (and uncommon) to make use of both the reg= ular and special buffers at the same time. One can filter based on the mem= ory controller channel. One or more channels can be tracked at a given tim= e.; Filter for memory controller channel 3 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high bandwidth workloads should be able to make use of all of the regular = buffers, but it will be difficult (and uncommon) to make use of both the re= gular and special buffers at the same time. One can filter based on the me= mory controller channel. One or more channels can be tracked at a given ti= me.; Filter for memory controller channel 3 only.", "UMask": "0x8", "Unit": "HA" }, @@ -3438,7 +3438,7 @@ "EventCode": "0x3", "EventName": "UNC_H_TRACKER_CYCLES_NE.ALL", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occpancy. HA tra= ckers are allocated as soon as a request enters the HA if an HT (Home Track= er) entry is available and is released after the snoop response and data re= turn (or post in the case of a write) and the response is returned on the r= ing.; Requests coming from both local and remote sockets.", + "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occupancy. HA tr= ackers are allocated as soon as a request enters the HA if an HT (Home Trac= ker) entry is available and is released after the snoop response and data r= eturn (or post in the case of a write) and the response is returned on the = ring.; Requests coming from both local and remote sockets.", "UMask": "0x3", "Unit": "HA" }, @@ -3448,7 +3448,7 @@ "EventCode": "0x3", "EventName": "UNC_H_TRACKER_CYCLES_NE.LOCAL", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occpancy. HA tra= ckers are allocated as soon as a request enters the HA if an HT (Home Track= er) entry is available and is released after the snoop response and data re= turn (or post in the case of a write) and the response is returned on the r= ing.; This filter includes only requests coming from the local socket.", + "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occupancy. HA tr= ackers are allocated as soon as a request enters the HA if an HT (Home Trac= ker) entry is available and is released after the snoop response and data r= eturn (or post in the case of a write) and the response is returned on the = ring.; This filter includes only requests coming from the local socket.", "UMask": "0x1", "Unit": "HA" }, @@ -3458,7 +3458,7 @@ "EventCode": "0x3", "EventName": "UNC_H_TRACKER_CYCLES_NE.REMOTE", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occpancy. HA tra= ckers are allocated as soon as a request enters the HA if an HT (Home Track= er) entry is available and is released after the snoop response and data re= turn (or post in the case of a write) and the response is returned on the r= ing.; This filter includes only requests coming from remote sockets.", + "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occupancy. HA tr= ackers are allocated as soon as a request enters the HA if an HT (Home Trac= ker) entry is available and is released after the snoop response and data r= eturn (or post in the case of a write) and the response is returned on the = ring.; This filter includes only requests coming from remote sockets.", "UMask": "0x2", "Unit": "HA" }, @@ -3878,7 +3878,7 @@ "EventCode": "0x18", "EventName": "UNC_H_WPQ_CYCLES_NO_REG_CREDITS.CHN0", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high banwidth workloads should be able to make use of all of the regu= lar buffers, but it will be difficult (and uncommon) to make use of both th= e regular and special buffers at the same time. One can filter based on th= e memory controller channel. One or more channels can be tracked at a give= n time.; Filter for memory controller channel 0 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high bandwidth workloads should be able to make use of all of the reg= ular buffers, but it will be difficult (and uncommon) to make use of both t= he regular and special buffers at the same time. One can filter based on t= he memory controller channel. One or more channels can be tracked at a giv= en time.; Filter for memory controller channel 0 only.", "UMask": "0x1", "Unit": "HA" }, @@ -3888,7 +3888,7 @@ "EventCode": "0x18", "EventName": "UNC_H_WPQ_CYCLES_NO_REG_CREDITS.CHN1", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high banwidth workloads should be able to make use of all of the regu= lar buffers, but it will be difficult (and uncommon) to make use of both th= e regular and special buffers at the same time. One can filter based on th= e memory controller channel. One or more channels can be tracked at a give= n time.; Filter for memory controller channel 1 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high bandwidth workloads should be able to make use of all of the reg= ular buffers, but it will be difficult (and uncommon) to make use of both t= he regular and special buffers at the same time. One can filter based on t= he memory controller channel. One or more channels can be tracked at a giv= en time.; Filter for memory controller channel 1 only.", "UMask": "0x2", "Unit": "HA" }, @@ -3898,7 +3898,7 @@ "EventCode": "0x18", "EventName": "UNC_H_WPQ_CYCLES_NO_REG_CREDITS.CHN2", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high banwidth workloads should be able to make use of all of the regu= lar buffers, but it will be difficult (and uncommon) to make use of both th= e regular and special buffers at the same time. One can filter based on th= e memory controller channel. One or more channels can be tracked at a give= n time.; Filter for memory controller channel 2 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high bandwidth workloads should be able to make use of all of the reg= ular buffers, but it will be difficult (and uncommon) to make use of both t= he regular and special buffers at the same time. One can filter based on t= he memory controller channel. One or more channels can be tracked at a giv= en time.; Filter for memory controller channel 2 only.", "UMask": "0x4", "Unit": "HA" }, @@ -3908,7 +3908,7 @@ "EventCode": "0x18", "EventName": "UNC_H_WPQ_CYCLES_NO_REG_CREDITS.CHN3", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high banwidth workloads should be able to make use of all of the regu= lar buffers, but it will be difficult (and uncommon) to make use of both th= e regular and special buffers at the same time. One can filter based on th= e memory controller channel. One or more channels can be tracked at a give= n time.; Filter for memory controller channel 3 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high bandwidth workloads should be able to make use of all of the reg= ular buffers, but it will be difficult (and uncommon) to make use of both t= he regular and special buffers at the same time. One can filter based on t= he memory controller channel. One or more channels can be tracked at a giv= en time.; Filter for memory controller channel 3 only.", "UMask": "0x8", "Unit": "HA" }, diff --git a/tools/perf/pmu-events/arch/x86/haswellx/uncore-interconnect.js= on b/tools/perf/pmu-events/arch/x86/haswellx/uncore-interconnect.json index 121de411d312..8f73a8649b39 100644 --- a/tools/perf/pmu-events/arch/x86/haswellx/uncore-interconnect.json +++ b/tools/perf/pmu-events/arch/x86/haswellx/uncore-interconnect.json @@ -1,24 +1,4 @@ [ - { - "BriefDescription": "Number of non data (control) flits transmitte= d . Derived from unc_q_txl_flits_g0.non_data", - "Counter": "0,1,2,3", - "EventName": "QPI_CTL_BANDWIDTH_TX", - "PerPkg": "1", - "PublicDescription": "Counts the number of flits transmitted acros= s the QPI Link. It includes filters for Idle, protocol, and Data Flits. E= ach flit is made up of 80 bits of information (in addition to some ECC data= ). In full-width (L0) mode, flits are made up of four fits, each of which = contains 20 bits of data (along with some additional ECC data). In half-w= idth (L0p) mode, the fits are only 10 bits, and therefore it takes twice as= many fits to transmit a flit. When one talks about QPI speed (for example= , 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the syste= m will transfer 1 flit at the rate of 1/4th the QPI speed. One can calcula= te the bandwidth of the link by taking: flits*80b/time. Note that this is = not the same as data bandwidth. For example, when we are transferring a 64= B cacheline across QPI, we will break it into 9 flits -- 1 with header info= rmation and 8 with 64 bits of actual data and an additional 16 bits of othe= r information. To calculate data bandwidth, one should therefore do: data = flits * 8B / time (for L0) or 4B instead of 8B for L0p.; Number of non-NULL= non-data flits transmitted across QPI. This basically tracks the protocol= overhead on the QPI link. One can get a good picture of the QPI-link char= acteristics by evaluating the protocol flits, data flits, and idle/null fli= ts. This includes the header flits for data packets.", - "ScaleUnit": "8Bytes", - "UMask": "0x4", - "Unit": "QPI" - }, - { - "BriefDescription": "Number of data flits transmitted . Derived fr= om unc_q_txl_flits_g0.data", - "Counter": "0,1,2,3", - "EventName": "QPI_DATA_BANDWIDTH_TX", - "PerPkg": "1", - "PublicDescription": "Counts the number of flits transmitted acros= s the QPI Link. It includes filters for Idle, protocol, and Data Flits. E= ach flit is made up of 80 bits of information (in addition to some ECC data= ). In full-width (L0) mode, flits are made up of four fits, each of which = contains 20 bits of data (along with some additional ECC data). In half-w= idth (L0p) mode, the fits are only 10 bits, and therefore it takes twice as= many fits to transmit a flit. When one talks about QPI speed (for example= , 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the syste= m will transfer 1 flit at the rate of 1/4th the QPI speed. One can calcula= te the bandwidth of the link by taking: flits*80b/time. Note that this is = not the same as data bandwidth. For example, when we are transferring a 64= B cacheline across QPI, we will break it into 9 flits -- 1 with header info= rmation and 8 with 64 bits of actual data and an additional 16 bits of othe= r information. To calculate data bandwidth, one should therefore do: data = flits * 8B / time (for L0) or 4B instead of 8B for L0p.; Number of data fli= ts transmitted over QPI. Each flit contains 64b of data. This includes bo= th DRS and NCB data flits (coherent and non-coherent). This can be used to= calculate the data bandwidth of the QPI link. One can get a good picture = of the QPI-link characteristics by evaluating the protocol flits, data flit= s, and idle/null flits. This does not include the header flits that go in = data packets.", - "ScaleUnit": "8Bytes", - "UMask": "0x2", - "Unit": "QPI" - }, { "BriefDescription": "Total Write Cache Occupancy; Any Source", "Counter": "0,1", @@ -53,7 +33,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.CLFLUSH", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x80", "Unit": "IRP" }, @@ -63,7 +43,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.CRD", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x2", "Unit": "IRP" }, @@ -73,7 +53,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.DRD", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x4", "Unit": "IRP" }, @@ -83,7 +63,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.PCIDCAHINT", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x20", "Unit": "IRP" }, @@ -93,7 +73,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.PCIRDCUR", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x1", "Unit": "IRP" }, @@ -103,7 +83,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.PCITOM", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x10", "Unit": "IRP" }, @@ -113,7 +93,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.RFO", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x8", "Unit": "IRP" }, @@ -123,7 +103,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.WBMTOI", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x40", "Unit": "IRP" }, @@ -493,7 +473,7 @@ "EventCode": "0x16", "EventName": "UNC_I_TRANSACTIONS.WRITES", "PerPkg": "1", - "PublicDescription": "Counts the number of Inbound transactions fr= om the IRP to the Uncore. This can be filtered based on request type in ad= dition to the source queue. Note the special filtering equation. We do OR= -reduction on the request type. If the SOURCE bit is set, then we also do = AND qualification based on the source portID.; Trackes only write requests.= Each write request should have a prefetch, so there is no need to explici= tly track these requests. For writes that are tickled and have to retry, t= he counter will be incremented for each retry.", + "PublicDescription": "Counts the number of Inbound transactions fr= om the IRP to the Uncore. This can be filtered based on request type in ad= dition to the source queue. Note the special filtering equation. We do OR= -reduction on the request type. If the SOURCE bit is set, then we also do = AND qualification based on the source portID.; Tracks only write requests. = Each write request should have a prefetch, so there is no need to explicit= ly track these requests. For writes that are tickled and have to retry, th= e counter will be incremented for each retry.", "UMask": "0x2", "Unit": "IRP" }, diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 587d8148da2f..4d7baf4d6adf 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -15,7 +15,7 @@ GenuineIntel-6-7A,v1.01,goldmontplus,core GenuineIntel-6-B6,v1.05,grandridge,core GenuineIntel-6-A[DE],v1.04,graniterapids,core GenuineIntel-6-(3C|45|46),v36,haswell,core -GenuineIntel-6-3F,v28,haswellx,core +GenuineIntel-6-3F,v29,haswellx,core GenuineIntel-6-7[DE],v1.22,icelake,core GenuineIntel-6-6[AC],v1.26,icelakex,core GenuineIntel-6-3A,v24,ivybridge,core --=20 2.48.0.rc2.279.g1de40edade-goog From nobody Sun Dec 14 13:49:51 2025 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 37AEA1990B7 for ; Thu, 16 Jan 2025 06:44:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009910; cv=none; b=VvDOiHAYJ9TY/LPgmP3AH70Wbbub43DZGa0VcCoga8rMaIh4jt/hhHA060TS6N2nkcHAk9nGy1vVPLSzCdVmXlg6xINYCiSQ079LyIJWwy5SajTDTH2JsnYQAGQm678Ef5alMkunF9MdiiGqEZTMUUN8apMZSbbaNwzFrXApVpo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009910; c=relaxed/simple; bh=N/ohoLLF0xl7C2a8JB7Uzh3WOx2HwvUbZpe72J0Kfuc=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=rVKKAsa3JOx9J7veYHeBT+gVRJqB0F+qPDrjw9fColdHquQ+hn6TwiXX0+xFfD0tBUW55ASi1bXg7ckZe/wHyH7bEjwm64ZvKDQuukmc1RvYgeCt1sJy7ZcQBxQH77vS/CCP0Cemx54UQ6acnCR6qM3Dn6NJfjnWvQQPKLKv2oM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=yReIIvOD; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="yReIIvOD" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-e3988f71863so1556582276.0 for ; Wed, 15 Jan 2025 22:44:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737009880; x=1737614680; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=jPGN3AMg6AszsdBl/REPB5PS5tKgTKC0WlZ6roqywLA=; b=yReIIvODuGDatDeQbCO4jUnamWGvtDqgQaM8NF4vkuAUOmB4QLu/JeYS4XXRXvulTf VzZqY4w0esLBCCkR+9Jf/xZ8rpSaNZjmYvly6K2JmMdhM8YZcq02q/6dg3C0upc3EApD DMv0Cr0q1eZ61jMjdrjsYuSmij5ksIUNYe//mXhKw8D+FOJBycBIHnukHJXnzG7OBh73 X2V4Ij58HrWnpq7Ixng/pHRXS6njsJYIzlEJhnkB4ANiCs+6o5XjxQPvtx82eLjBPkDB 0565MUIPFTv12J/TxLoayNijdjBH1rId6J6r2sw184a3kZ13pzxkWspzStviKld40gQd 9M/g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737009880; x=1737614680; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=jPGN3AMg6AszsdBl/REPB5PS5tKgTKC0WlZ6roqywLA=; b=At5WETsMou+/0BcAzDSwzdcQ5e2rL/bhOZj1pJK+Eg4VqeTahVpRHFqSlhBLUx71TJ albiiHWnHpF2ffm06qVEifqWgJrRTjd1Nr1SOtwTrr4m1kW1nx+JaoTbg2HH6jyUcgS7 Skn0s+iXYb++7KeurgDXbJZcXe0uUYeeSAp9jWllBVAUC79/pIpvOkOSxaJpRTUkCsDt NPSSAPEvovX7NU38suBH+g8QQ/W//thKCP92r4TKon7GZZA9ZhYEw+A4en86SBKoMu9A oW2KCtfbwenAA2fyU8XPZZ9TiUhdxdD5qQf8CyZ6Da1w04KLL3KDGWg5SYBaLvvYKsm+ qNyg== X-Forwarded-Encrypted: i=1; AJvYcCWmaQnCM7Kwr7wJ+VuAFZlJKacNx8TgSk1Kr1GN8DeSoDQSoap6fbHofc4LqwIFGCDuHUXBtX8TJWdOyRM=@vger.kernel.org X-Gm-Message-State: AOJu0Ywd7tQIqMJd7hYcIScm4FJB9xBt6nM6kXrmDboS8RZyE3r8mCk1 iwm3a0qNt1dJHsoT96Eecv8PAzyi5eyZCPGvmHoK+HDKvnjWffdAYBdQRVl4rckQb2dL578wrnx 66x9Hsw== X-Google-Smtp-Source: AGHT+IFFqgGts9jdgFsJQwPWLABmJi0dA8ihNYgtWZhDTgf/EtwgIEw0RrJU+DoiAWEjvUyFDPjBlJ7F/9h7 X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:7bd:ce19:c701:bfa3]) (user=irogers job=sendgmr) by 2002:a81:8d4b:0:b0:6de:19f:34d7 with SMTP id 00721157ae682-6f53121d0fbmr675527b3.2.1737009879433; Wed, 15 Jan 2025 22:44:39 -0800 (PST) Date: Wed, 15 Jan 2025 22:43:46 -0800 In-Reply-To: <20250116064355.344245-1-irogers@google.com> Message-Id: <20250116064355.344245-15-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250116064355.344245-1-irogers@google.com> X-Mailer: git-send-email 2.48.0.rc2.279.g1de40edade-goog Subject: [PATCH v2 14/23] perf vendor events: Update Icelake events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.22 to v1.24. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v1.24: https://github.com/intel/perfmon/commit/d4f10746cf549466723d17cd214e1ee9cb7= bac11 The TMA 5.01 update is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-developed-by: Caleb Biggers Signed-off-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../pmu-events/arch/x86/icelake/cache.json | 34 +- .../pmu-events/arch/x86/icelake/frontend.json | 17 - .../arch/x86/icelake/icl-metrics.json | 849 +++++++++--------- .../pmu-events/arch/x86/icelake/memory.json | 13 +- .../arch/x86/icelake/metricgroups.json | 10 +- .../pmu-events/arch/x86/icelake/pipeline.json | 30 +- .../arch/x86/icelake/uncore-interconnect.json | 76 -- .../arch/x86/icelake/uncore-other.json | 2 +- .../arch/x86/icelake/virtual-memory.json | 18 + tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 10 files changed, 500 insertions(+), 551 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/icelake/cache.json b/tools/perf= /pmu-events/arch/x86/icelake/cache.json index 3508340acd0e..015f70f157d1 100644 --- a/tools/perf/pmu-events/arch/x86/icelake/cache.json +++ b/tools/perf/pmu-events/arch/x86/icelake/cache.json @@ -75,11 +75,11 @@ "UMask": "0x2" }, { - "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache when triggered by an L2 cache fill.", + "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache.", "Counter": "0,1,2,3", "EventCode": "0xF2", "EventName": "L2_LINES_OUT.SILENT", - "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache when triggered by an L2 cache fill. These lines are ty= pically in Shared or Exclusive state. A non-threaded event.", + "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache. These lines are typically in Shared or Exclusive stat= e. A non-threaded event.", "SampleAfterValue": "200003", "UMask": "0x1" }, @@ -251,7 +251,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_LOADS", - "PEBS": "1", "PublicDescription": "Counts all retired load instructions. This e= vent accounts for SW prefetch instructions of PREFETCHNTA or PREFETCHT0/1/2= or PREFETCHW.", "SampleAfterValue": "1000003", "UMask": "0x81" @@ -262,7 +261,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_STORES", - "PEBS": "1", "PublicDescription": "Counts all retired store instructions.", "SampleAfterValue": "1000003", "UMask": "0x82" @@ -273,7 +271,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts all retired memory instructions - loa= ds and stores.", "SampleAfterValue": "1000003", "UMask": "0x83" @@ -284,7 +281,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.LOCK_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with locked= access.", "SampleAfterValue": "100007", "UMask": "0x21" @@ -295,7 +291,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions that split = across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x41" @@ -306,7 +301,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_STORES", - "PEBS": "1", "PublicDescription": "Counts retired store instructions that split= across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x42" @@ -317,7 +311,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_LOADS", - "PEBS": "1", "PublicDescription": "Number of retired load instructions that (st= art a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x11" @@ -328,7 +321,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_STORES", - "PEBS": "1", "PublicDescription": "Number of retired store instructions that (s= tart a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x12" @@ -339,7 +331,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were L3 and cross-core snoop hits in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x2" @@ -350,7 +341,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were HitM responses from shared L3.", "SampleAfterValue": "20011", "UMask": "0x4" @@ -361,7 +351,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS", - "PEBS": "1", "PublicDescription": "Counts the retired load instructions whose d= ata sources were L3 hit and cross-core snoop missed in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x1" @@ -372,7 +361,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NONE", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were hits in L3 without snoops required.", "SampleAfterValue": "100003", "UMask": "0x8" @@ -383,7 +371,6 @@ "Data_LA": "1", "EventCode": "0xd4", "EventName": "MEM_LOAD_MISC_RETIRED.UC", - "PEBS": "1", "PublicDescription": "Retired instructions with at least one load = to uncacheable memory-type, or at least one cache-line split locked access = (Bus Lock).", "SampleAfterValue": "100007", "UMask": "0x4" @@ -394,7 +381,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.FB_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop was load missed in L1 but hit FB (Fill Buffers) due to preceding= miss to the same cache line with data not ready.", "SampleAfterValue": "100007", "UMask": "0x40" @@ -405,7 +391,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L1 data cache. This event includes all SW prefet= ches and lock instructions regardless of the data source.", "SampleAfterValue": "1000003", "UMask": "0x1" @@ -416,7 +401,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L1 cache.", "SampleAfterValue": "200003", "UMask": "0x8" @@ -427,7 +411,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with L2 cac= he hits as data sources.", "SampleAfterValue": "200003", "UMask": "0x2" @@ -438,7 +421,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions missed L2 c= ache as data sources.", "SampleAfterValue": "100021", "UMask": "0x10" @@ -449,7 +431,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L3 cache.", "SampleAfterValue": "100021", "UMask": "0x4" @@ -460,7 +441,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L3 cache.", "SampleAfterValue": "50021", "UMask": "0x20" @@ -910,6 +890,16 @@ "SampleAfterValue": "1000003", "UMask": "0x8" }, + { + "BriefDescription": "Cycles with outstanding code read requests pe= nding.", + "Counter": "0,1,2,3", + "CounterMask": "1", + "EventCode": "0x60", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_CODE= _RD", + "PublicDescription": "Cycles with outstanding code read requests p= ending. Code Read requests include both cacheable and non-cacheable Code R= eads. Requests are considered outstanding from the time they miss the core= 's L2 cache until the transaction completion message is sent to the request= or.", + "SampleAfterValue": "1000003", + "UMask": "0x2" + }, { "BriefDescription": "Cycles where at least 1 outstanding Demand RF= O request is pending.", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/icelake/frontend.json b/tools/p= erf/pmu-events/arch/x86/icelake/frontend.json index e7c7d4d4152d..7afa2436d584 100644 --- a/tools/perf/pmu-events/arch/x86/icelake/frontend.json +++ b/tools/perf/pmu-events/arch/x86/icelake/frontend.json @@ -44,7 +44,6 @@ "EventName": "FRONTEND_RETIRED.ANY_DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x1", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= DSB (Decode stream buffer i.e. the decoded instruction-cache) miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -56,7 +55,6 @@ "EventName": "FRONTEND_RETIRED.DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x11", - "PEBS": "1", "PublicDescription": "Number of retired Instructions that experien= ced a critical DSB (Decode stream buffer i.e. the decoded instruction-cache= ) miss. Critical means stalls were exposed to the back-end as a result of t= he DSB miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -68,7 +66,6 @@ "EventName": "FRONTEND_RETIRED.ITLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x14", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= iTLB (Instruction TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -80,7 +77,6 @@ "EventName": "FRONTEND_RETIRED.L1I_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x12", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L1 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -92,7 +88,6 @@ "EventName": "FRONTEND_RETIRED.L2_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x13", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L2 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -104,7 +99,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x500106", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 1 cycle which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -116,7 +110,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_128", "MSRIndex": "0x3F7", "MSRValue": "0x508006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 12= 8 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -128,7 +121,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_16", "MSRIndex": "0x3F7", "MSRValue": "0x501006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 16 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -140,7 +132,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2", "MSRIndex": "0x3F7", "MSRValue": "0x500206", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -152,7 +143,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_256", "MSRIndex": "0x3F7", "MSRValue": "0x510006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 25= 6 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -164,7 +154,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x100206", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after the front-end had at least 1 bubble-slot for a per= iod of 2 cycles. A bubble-slot is an empty issue-pipeline slot while there = was no RAT stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -176,7 +165,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_32", "MSRIndex": "0x3F7", "MSRValue": "0x502006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 32 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -188,7 +176,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_4", "MSRIndex": "0x3F7", "MSRValue": "0x500406", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 4 = cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -200,7 +187,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_512", "MSRIndex": "0x3F7", "MSRValue": "0x520006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 51= 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -212,7 +198,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_64", "MSRIndex": "0x3F7", "MSRValue": "0x504006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 64= cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -224,7 +209,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_8", "MSRIndex": "0x3F7", "MSRValue": "0x500806", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 8 cycles. During thi= s period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -236,7 +220,6 @@ "EventName": "FRONTEND_RETIRED.STLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x15", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= STLB (2nd level TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x1" diff --git a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json b/tool= s/perf/pmu-events/arch/x86/icelake/icl-metrics.json index 9085ea60f516..63e28a03dc60 100644 --- a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json +++ b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json @@ -89,12 +89,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", + "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + = UOPS_DISPATCHED.PORT_5 + UOPS_DISPATCHED.PORT_6) / (4 * tma_info_core_core_= clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -106,7 +106,7 @@ "MetricExpr": "34 * ASSISTS.ANY / tma_info_thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%" }, @@ -129,11 +129,104 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions.", + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Ret", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4k_alias= ing + tma_fb_full)))", + "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_4k_aliasing + tma_fb_fu= ll)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tm= a_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_lock_latency / (tma_= dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma_lock_latenc= y + tma_split_loads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * = (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_boun= d + tma_store_bound)) * (tma_split_loads / (tma_dtlb_load + tma_store_fwd_b= lk + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4= k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_= bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * = (tma_split_stores / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store)) + tma_memory_bound * (tma_stor= e_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tm= a_store_bound)) * (tma_store_latency / (tma_store_latency + tma_false_shari= ng + tma_split_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * (tma_fetch_latency * (tma_ms_swi= tches + tma_branch_resteers * (tma_clears_resteers + 10 * tma_microcode_seq= uencer * tma_other_mispredicts / tma_branch_mispredicts * tma_mispredicts_r= esteers) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_br= anches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma= _ms_switches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms /= (tma_mite + tma_dsb + tma_lsd + tma_ms))) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Fed;FetchBW;Frontend", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * (tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_cle= ars_resteers + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_b= ranch_mispredicts * tma_mispredicts_resteers) / (tma_mispredicts_resteers += tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_it= lb_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switc= hes) + tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_lsd + tma_m= s)) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_mis= predicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / = tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_bo= und * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_0)= / (tma_divider + tma_serializing_operation + tma_ports_utilization) + tma_= microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer)= * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound= / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store= _bound)) * (tma_dtlb_store / (tma_store_latency + tma_false_sharing + tma_s= plit_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Offcore", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Ret", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions", "MetricExpr": "tma_light_operations * BR_INST_RETIRED.ALL_BRANCHES= / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_branch_instructions", @@ -147,7 +240,7 @@ "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredic= tions, tma_mispredicts_resteers", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost= , tma_mispredicts_resteers", "ScaleUnit": "100%" }, { @@ -155,8 +248,8 @@ "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -164,8 +257,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", "ScaleUnit": "100%" }, { @@ -173,18 +266,66 @@ "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRE= D.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / t= ma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "MetricExpr": "max(0, tma_icache_misses - tma_code_l2_miss)", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_hit", + "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_COD= E_RD / tma_info_thread_clks", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_miss", + "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_2M_= 4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_4K = / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(29 * tma_info_system_core_frequency * MEM_LOAD_L3_= HIT_RETIRED.XSNP_HITM + 23.5 * tma_info_system_core_frequency * MEM_LOAD_L3= _HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L= 1_MISS / 2) / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricExpr": "((32.5 * tma_info_system_core_frequency - 3.5 * tma= _info_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM + (27 * tm= a_info_system_core_frequency - 3.5 * tma_info_system_core_frequency) * MEM_= LOAD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RE= TIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related= metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_fals= e_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { @@ -194,25 +335,25 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "23.5 * tma_info_system_core_frequency * MEM_LOAD_L3= _HIT_RETIRED.XSNP_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_= MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(27 * tma_info_system_core_frequency - 3.5 * tma_in= fo_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT * (1 + MEM_LOA= D_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT. Related metrics: tma_bottleneck_memory_synchron= ization, tma_contested_accesses, tma_false_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cpu@INS= T_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", + "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%" }, @@ -221,7 +362,7 @@ "MetricExpr": "ARITH.DIVIDER_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, @@ -231,8 +372,8 @@ "MetricExpr": "CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_thread_clk= s + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks - tma_l2_bound", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS_PS", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%" }, { @@ -241,7 +382,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", "ScaleUnit": "100%" }, { @@ -249,44 +390,44 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D1= @ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE= _ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D0= x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYC= LE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store, tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_me= mory_synchronization", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D1@ = + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", + "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D0x1= @ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, = tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_memory_synchroniz= ation", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", "MetricExpr": "32.5 * tma_info_system_core_frequency * OCR.DEMAND_= RFO.L3_HIT.SNOOP_HITM / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing= , tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricExpr": "L1D_PEND_MISS.FB_FULL / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_bottleneck_cache_memory_bandwi= dth, tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store= _latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy, tma_streaming_stores", "ScaleUnit": "100%" }, { @@ -296,7 +437,7 @@ "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LA= TENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_sw= itches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tm= a_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { @@ -306,16 +447,16 @@ "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or up to= ([SNB+] four; [ADL+] five) uops", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", "MetricExpr": "tma_heavy_operations - tma_microcode_sequencer", "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", "MetricName": "tma_few_uops_instructions", "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or up t= o ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number o= f uops in such instructions. Related metrics: tma_decoder0_alone", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions.= Related metrics: tma_decoder0_alone", "ScaleUnit": "100%" }, { @@ -324,7 +465,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", "ScaleUnit": "100%" }, { @@ -333,7 +474,15 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "MetricExpr": "ARITH.FP_DIVIDER_ACTIVE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_fp_divider", + "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -341,7 +490,7 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -350,7 +499,7 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -359,8 +508,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -368,8 +517,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -377,7 +526,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.512B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_512b", - "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -389,17 +538,17 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", - "MetricExpr": "tma_microcode_sequencer + tma_retiring * (UOPS_DECO= DED.DEC0 - cpu@UOPS_DECODED.DEC0\\,cmask\\=3D1@) / IDQ.MITE_UOPS", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "MetricExpr": "tma_microcode_sequencer + tma_retiring * (UOPS_DECO= DED.DEC0 - cpu@UOPS_DECODED.DEC0\\,cmask\\=3D0x1@) / IDQ.MITE_UOPS", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { @@ -407,41 +556,41 @@ "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA sl= ots wasted per non-speculative branch misprediction (retired JEClear)", + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "tma_info_bottleneck_mispredictions * tma_info_threa= d_slots / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 5 / BR_MISP_RETIRED.ALL_BRANCHES / 100", "MetricGroup": "Bad;BrMispredicts;tma_issueBM", "MetricName": "tma_info_bad_spec_branch_misprediction_cost", - "PublicDescription": "Branch Misprediction Cost: Fraction of TMA s= lots wasted per non-speculative branch misprediction (retired JEClear). Rel= ated metrics: tma_branch_mispredicts, tma_info_bottleneck_mispredictions, t= ma_mispredicts_resteers" + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional taken branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_taken < 200" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" }, { - "BriefDescription": "Instructions per retired mispredicts for retu= rn branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_ret", @@ -455,32 +604,11 @@ "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200" }, { - "BriefDescription": "Speculative to Retired ratio of all clears (c= overing mispredicts and nukes)", + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", "MetricGroup": "BrMispredicts", "MetricName": "tma_info_bad_spec_spec_clears_ratio" }, - { - "BriefDescription": "Probability of Core Bound bottleneck hidden b= y SMT-profiling artifacts", - "MetricExpr": "tma_info_botlnk_l0_core_bound_likely", - "MetricGroup": "Cor;Metric;SMT", - "MetricName": "tma_info_botlnk_core_bound_likely", - "MetricThreshold": "tma_info_botlnk_core_bound_likely > 0.5" - }, - { - "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck.", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd))", - "MetricGroup": "DSBmiss;Fed;Scaled_Slots;tma_issueFB", - "MetricName": "tma_info_botlnk_dsb_misses", - "MetricThreshold": "tma_info_botlnk_dsb_misses > 10" - }, - { - "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck.", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", - "MetricGroup": "Fed;FetchLat;IcMiss;Scaled_Slots;tma_issueFL", - "MetricName": "tma_info_botlnk_ic_misses", - "MetricThreshold": "tma_info_botlnk_ic_misses > 5" - }, { "BriefDescription": "Probability of Core Bound bottleneck hidden b= y SMT-profiling artifacts", "MetricConstraint": "NO_GROUP_EVENTS", @@ -491,8 +619,8 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_lsd = + tma_mite)))", - "MetricGroup": "DSB;FetchBW;tma_issueFB", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_lsd + tma_ms)))", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp" @@ -500,7 +628,7 @@ { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_lsd + tma_mite))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd + tma_ms))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -509,108 +637,10 @@ { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", - "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", - "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "tma_info_bottleneck_big_code", - "MetricThreshold": "tma_info_bottleneck_big_code > 20" - }, - { - "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", - "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", - "MetricGroup": "BvBO;Ret", - "MetricName": "tma_info_bottleneck_branching_overhead", - "MetricThreshold": "tma_info_bottleneck_branching_overhead > 5", - "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_= fb_full + tma_l1_hit_latency + tma_lock_latency + tma_split_loads + tma_sto= re_fwd_blk)))", - "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", - "MetricName": "tma_info_bottleneck_cache_memory_bandwidth", - "MetricThreshold": "tma_info_bottleneck_cache_memory_bandwidth > 2= 0", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_store_bound) + tma_memory_bound * (tma_store_bound / (tma_dram_bou= nd + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_= store_latency / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tm= a_store_latency + tma_streaming_stores)) + tma_memory_bound * (tma_l1_bound= / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store= _bound)) * (tma_l1_hit_latency / (tma_4k_aliasing + tma_dtlb_load + tma_fb_= full + tma_l1_hit_latency + tma_lock_latency + tma_split_loads + tma_store_= fwd_blk)))", - "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", - "MetricName": "tma_info_bottleneck_cache_memory_latency", - "MetricThreshold": "tma_info_bottleneck_cache_memory_latency > 20", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" - }, - { - "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_ports_utilization + tma_serializing_operation) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_ports_utilization + tma_serializin= g_operation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", - "MetricGroup": "BvCB;Cor;tma_issueComp", - "MetricName": "tma_info_bottleneck_compute_bound_est", - "MetricThreshold": "tma_info_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * tma_fetch_latency * (tma_ms_swit= ches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteer= s * (10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts)) / (tma_clears_resteers + tma_mispredicts_resteers + tma_unknown_b= ranches)) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + t= ma_itlb_misses + tma_lcp + tma_ms_switches)) - tma_info_bottleneck_big_code= ", - "MetricGroup": "BvFB;Fed;FetchBW;Frontend", - "MetricName": "tma_info_bottleneck_instruction_fetch_bw", - "MetricThreshold": "tma_info_bottleneck_instruction_fetch_bw > 20" - }, - { - "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clea= rs_resteers + tma_mispredicts_resteers * (10 * tma_microcode_sequencer * tm= a_other_mispredicts / tma_branch_mispredicts)) / (tma_clears_resteers + tma= _mispredicts_resteers + tma_unknown_branches)) / (tma_branch_resteers + tma= _dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_swit= ches) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_m= ispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes = / tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_= bound * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_= 0) / (tma_divider + tma_ports_utilization + tma_serializing_operation) + tm= a_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequence= r) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", - "MetricName": "tma_info_bottleneck_irregular_overhead", - "MetricThreshold": "tma_info_bottleneck_irregular_overhead > 10", - "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" - }, - { - "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_4k_aliasing + tm= a_dtlb_load + tma_fb_full + tma_l1_hit_latency + tma_lock_latency + tma_spl= it_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bound / (tma= _dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)= ) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_split_store= s + tma_store_latency + tma_streaming_stores)))", - "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_data_tlbs", - "MetricThreshold": "tma_info_bottleneck_memory_data_tlbs > 20", - "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store, tma_info_bottleneck_memory_synchronization" - }, - { - "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * = tma_false_sharing / (tma_dtlb_store + tma_false_sharing + tma_split_stores = + tma_store_latency + tma_streaming_stores - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", - "MetricGroup": "BvMS;Mem;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_synchronization", - "MetricThreshold": "tma_info_bottleneck_memory_synchronization > 1= 0", - "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_dtlb_load, tma_dtlb_store, tma_info_bottleneck_me= mory_data_tlbs" - }, - { - "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", - "MetricName": "tma_info_bottleneck_mispredictions", - "MetricThreshold": "tma_info_bottleneck_mispredictions > 20", - "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" - }, - { - "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", - "MetricExpr": "100 - (tma_info_bottleneck_big_code + tma_info_bott= leneck_instruction_fetch_bw + tma_info_bottleneck_mispredictions + tma_info= _bottleneck_cache_memory_bandwidth + tma_info_bottleneck_cache_memory_laten= cy + tma_info_bottleneck_memory_data_tlbs + tma_info_bottleneck_memory_sync= hronization + tma_info_bottleneck_compute_bound_est + tma_info_bottleneck_i= rregular_overhead + tma_info_bottleneck_branching_overhead + tma_info_bottl= eneck_useful_work)", - "MetricGroup": "BvOB;Cor;Offcore", - "MetricName": "tma_info_bottleneck_other_bottlenecks", - "MetricThreshold": "tma_info_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls." - }, - { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", - "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "BvUW;Ret", - "MetricName": "tma_info_bottleneck_useful_work", - "MetricThreshold": "tma_info_bottleneck_useful_work > 20" + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5" }, { "BriefDescription": "Fraction of branches that are CALL or RET", @@ -671,11 +701,11 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED.VECTOR) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -688,20 +718,20 @@ "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D1\\,edge@", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost" }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D1\\,edge@", + "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency" }, @@ -737,7 +767,13 @@ "MetricName": "tma_info_frontend_lsd_coverage" }, { - "BriefDescription": "Branch instructions per taken branch.", + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, + { + "BriefDescription": "Branch instructions per taken branch", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -755,7 +791,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -763,7 +799,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -771,7 +807,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit in= struction (lower number means higher occurrence rate)", @@ -779,7 +815,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx512", "MetricThreshold": "tma_info_inst_mix_iparith_avx512 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -787,7 +823,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -795,7 +831,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -840,7 +876,7 @@ }, { "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", - "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umas= k\\=3D0xF@", + "MetricExpr": "INST_RETIRED.ANY / SW_PREFETCH_ACCESS.ANY", "MetricGroup": "Prefetches", "MetricName": "tma_info_inst_mix_ipswpf", "MetricThreshold": "tma_info_inst_mix_ipswpf < 100" @@ -850,21 +886,9 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 11", + "MetricThreshold": "tma_info_inst_mix_iptb < 5 * 2 + 1", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp" }, - { - "BriefDescription": "\"Bus lock\" per kilo instruction", - "MetricExpr": "tma_info_memory_mix_bus_lock_pki", - "MetricGroup": "Mem;Metric", - "MetricName": "tma_info_memory_bus_lock_pki" - }, - { - "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", - "MetricExpr": "tma_info_memory_tlb_code_stlb_mpki", - "MetricGroup": "Fed;MemoryTLB;Metric", - "MetricName": "tma_info_memory_code_stlb_mpki" - }, { "BriefDescription": "Average per-core data fill bandwidth to the L= 1 data cache [GB / sec]", "MetricExpr": "tma_info_memory_l1d_cache_fill_bw", @@ -889,12 +913,6 @@ "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_core_l3_cache_fill_bw_2t" }, - { - "BriefDescription": "Average Parallel L2 cache miss data reads", - "MetricExpr": "tma_info_memory_latency_data_l2_mlp", - "MetricGroup": "Memory_BW;Metric;Offcore", - "MetricName": "tma_info_memory_data_l2_mlp" - }, { "BriefDescription": "Fill Buffer (FB) hits per kilo instructions f= or retired demand loads (L1D misses that merge into ongoing miss-handling e= ntries)", "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", @@ -903,16 +921,10 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw" }, - { - "BriefDescription": "Average per-core data fill bandwidth to the L= 1 data cache [GB / sec]", - "MetricExpr": "tma_info_memory_l1d_cache_fill_bw", - "MetricGroup": "Core_Metric;Mem;MemoryBW", - "MetricName": "tma_info_memory_l1d_cache_fill_bw_2t" - }, { "BriefDescription": "L1 cache true misses per kilo instruction for= retired demand loads", "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", @@ -927,16 +939,10 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw" }, - { - "BriefDescription": "Average per-core data fill bandwidth to the L= 2 cache [GB / sec]", - "MetricExpr": "tma_info_memory_l2_cache_fill_bw", - "MetricGroup": "Core_Metric;Mem;MemoryBW", - "MetricName": "tma_info_memory_l2_cache_fill_bw_2t" - }, { "BriefDescription": "L2 cache hits per kilo instruction for all de= mand loads (including speculative)", "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.AN= Y", @@ -969,28 +975,16 @@ }, { "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", - "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration= _time", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", "MetricGroup": "Mem;MemoryBW;Offcore", "MetricName": "tma_info_memory_l3_cache_access_bw" }, - { - "BriefDescription": "Average per-core data access bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "tma_info_memory_l3_cache_access_bw", - "MetricGroup": "Core_Metric;Mem;MemoryBW;Offcore", - "MetricName": "tma_info_memory_l3_cache_access_bw_2t" - }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw" }, - { - "BriefDescription": "Average per-core data fill bandwidth to the L= 3 cache [GB / sec]", - "MetricExpr": "tma_info_memory_l3_cache_fill_bw", - "MetricGroup": "Core_Metric;Mem;MemoryBW", - "MetricName": "tma_info_memory_l3_cache_fill_bw_2t" - }, { "BriefDescription": "L3 cache true misses per kilo instruction for= retired demand loads", "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", @@ -1005,52 +999,22 @@ }, { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", - "MetricExpr": "tma_info_memory_load_l2_miss_latency", - "MetricGroup": "Memory_Lat;Offcore", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency" }, { "BriefDescription": "Average Parallel L2 cache miss demand Loads", - "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D1@", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", "MetricGroup": "Memory_BW;Offcore", "MetricName": "tma_info_memory_latency_load_l2_mlp" }, - { - "BriefDescription": "Average Latency for L3 cache miss demand Load= s", - "MetricExpr": "cpu@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,u= mask\\=3D0x10@ / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", - "MetricName": "tma_info_memory_latency_load_l3_miss_latency" - }, - { - "BriefDescription": "Average Latency for L2 cache miss demand Load= s", - "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", - "MetricGroup": "Clocks_Latency;Memory_Lat;Offcore", - "MetricName": "tma_info_memory_load_l2_miss_latency" - }, - { - "BriefDescription": "Average Parallel L2 cache miss demand Loads", - "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", - "MetricGroup": "Memory_BW;Metric;Offcore", - "MetricName": "tma_info_memory_load_l2_mlp" - }, - { - "BriefDescription": "Average Latency for L3 cache miss demand Load= s", - "MetricExpr": "cpu@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,u= mask\\=3D0x0@ / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", - "MetricGroup": "Clocks_Latency;Memory_Lat;Offcore", - "MetricName": "tma_info_memory_load_l3_miss_latency" - }, { "BriefDescription": "Actual Average Latency for L1 data-cache miss= demand load operations (in core cycles)", "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_RETIRED.L1_MISS += MEM_LOAD_RETIRED.FB_HIT)", "MetricGroup": "Mem;MemoryBound;MemoryLat", "MetricName": "tma_info_memory_load_miss_real_latency" }, - { - "BriefDescription": "STLB (2nd level TLB) data load speculative mi= sses per kilo instruction (misses of any page-size that complete the page w= alk)", - "MetricExpr": "tma_info_memory_tlb_load_stlb_mpki", - "MetricGroup": "Mem;MemoryTLB;Metric", - "MetricName": "tma_info_memory_load_stlb_mpki" - }, { "BriefDescription": "\"Bus lock\" per kilo instruction", "MetricExpr": "1e3 * SQ_MISC.BUS_LOCK / INST_RETIRED.ANY", @@ -1059,7 +1023,7 @@ }, { "BriefDescription": "Un-cacheable retired load per kilo instructio= n", - "MetricExpr": "tma_info_memory_uc_load_pki", + "MetricExpr": "1e3 * MEM_LOAD_MISC_RETIRED.UC / INST_RETIRED.ANY", "MetricGroup": "Mem", "MetricName": "tma_info_memory_mix_uc_load_pki" }, @@ -1071,17 +1035,11 @@ "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)" }, { - "BriefDescription": "Utilization of the core's Page Walker(s) serv= ing STLB misses triggered by instruction/Load/Store accesses", - "MetricExpr": "tma_info_memory_tlb_page_walks_utilization", - "MetricGroup": "Core_Metric;Mem;MemoryTLB", - "MetricName": "tma_info_memory_page_walks_utilization", - "MetricThreshold": "tma_info_memory_page_walks_utilization > 0.5" - }, - { - "BriefDescription": "STLB (2nd level TLB) data store speculative m= isses per kilo instruction (misses of any page-size that complete the page = walk)", - "MetricExpr": "tma_info_memory_tlb_store_stlb_mpki", - "MetricGroup": "Mem;MemoryTLB;Metric", - "MetricName": "tma_info_memory_store_stlb_mpki" + "BriefDescription": "Rate of L2 HW prefetched lines that were not = used by demand accesses", + "MetricExpr": "L2_LINES_OUT.USELESS_HWPF / (L2_LINES_OUT.SILENT + = L2_LINES_OUT.NON_SILENT)", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_memory_prefetches_useless_hwpf", + "MetricThreshold": "tma_info_memory_prefetches_useless_hwpf > 0.15" }, { "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", @@ -1108,15 +1066,9 @@ "MetricGroup": "Mem;MemoryTLB", "MetricName": "tma_info_memory_tlb_store_stlb_mpki" }, - { - "BriefDescription": "Un-cacheable retired load per kilo instructio= n", - "MetricExpr": "1e3 * MEM_LOAD_MISC_RETIRED.UC / INST_RETIRED.ANY", - "MetricGroup": "Mem;Metric", - "MetricName": "tma_info_memory_uc_load_pki" - }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D1@)", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, @@ -1143,18 +1095,18 @@ "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -1172,14 +1124,14 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / duration_time / 1e3", + "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / tma_info_system_time / 1e3", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", - "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_bottlenec= k_cache_memory_bandwidth, tma_mem_bandwidth, tma_sq_full" + "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_bottleneck_cache_memory_ban= dwidth, tma_fb_full, tma_mem_bandwidth, tma_sq_full" }, { "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.8_FLOPS + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1= e9 / duration_time", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.8_FLOPS + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1= e9 / tma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" @@ -1189,13 +1141,14 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi" + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -1204,12 +1157,25 @@ "MetricName": "tma_info_system_kernel_utilization", "MetricThreshold": "tma_info_system_kernel_utilization > 0.05" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "power@energy\\-pkg@ * 61 / (tma_info_system_time * = 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" + }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for baseline license level 0", "MetricExpr": "CORE_POWER.LVL0_TURBO_LICENSE / tma_info_core_core_= clks", "MetricGroup": "Power", "MetricName": "tma_info_system_power_license0_utilization", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes" }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 1", @@ -1217,7 +1183,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license1_utilization", "MetricThreshold": "tma_info_system_power_license1_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions" }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 2 (introduced in SKX)", @@ -1225,7 +1191,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license2_utilization", "MetricThreshold": "tma_info_system_power_license2_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions" }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", @@ -1239,6 +1205,13 @@ "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -1246,7 +1219,7 @@ "MetricName": "tma_info_system_turbo_utilization" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -1255,14 +1228,15 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi" + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -1272,13 +1246,13 @@ }, { "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "TOPDOWN.SLOTS", + "MetricExpr": "slots", "MetricGroup": "TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots" }, { "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", - "MetricExpr": "(tma_info_thread_slots / (TOPDOWN.SLOTS / 2) if #SM= T_on else 1)", + "MetricExpr": "(tma_info_thread_slots / (slots / 2) if #SMT_on els= e 1)", "MetricGroup": "SMT;TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots_utilization" }, @@ -1294,33 +1268,41 @@ "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 7.5" + "MetricThreshold": "tma_info_thread_uptb < 5 * 1.5" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "MetricExpr": "tma_divider - tma_fp_divider", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_int_divider", + "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB= _HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_micr= ocode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates fraction of cyc= les with demand load accesses that hit the L1 cache", + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", - "MetricName": "tma_l1_hit_latency", - "MetricThreshold": "tma_l1_hit_latency > 0.1 & (tma_l1_bound > 0.1= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles with demand load accesses that hit the L1 cache. The short latency of = the L1 data cache may be exposed in pointer-chasing memory access patterns = as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%" }, { @@ -1329,8 +1311,17 @@ "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + L1D_PEND_MISS.FB_FULL_PERIODS)= * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks)", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT_PS", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "3.5 * tma_info_system_core_frequency * MEM_LOAD_RET= IRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) = / tma_info_thread_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, { @@ -1339,17 +1330,17 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STA= LLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "9 * tma_info_system_core_frequency * (MEM_LOAD_RETI= RED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) = / tma_info_thread_clks", + "MetricExpr": "(12.5 * tma_info_system_core_frequency - 3.5 * tma_= info_system_core_frequency) * (MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETI= RED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_info_bottleneck_cache_memory_latency, tma_mem_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", "ScaleUnit": "100%" }, { @@ -1357,18 +1348,18 @@ "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", "ScaleUnit": "100%" }, { @@ -1385,7 +1376,7 @@ "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1393,16 +1384,40 @@ "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10= * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAN= DING.CYCLES_WITH_DEMAND_RFO))) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, @@ -1412,7 +1427,7 @@ "MetricGroup": "FetchBW;LSD;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_lsd", "MetricThreshold": "tma_lsd > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure", "ScaleUnit": "100%" }, { @@ -1422,16 +1437,16 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_bottleneck_cache= _memory_bandwidth, tma_info_system_dram_bw_use, tma_sq_full", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%" }, { @@ -1439,8 +1454,8 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_info_bottleneck_cache_memory_latency, tma_l3_hit_la= tency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%" }, { @@ -1450,11 +1465,11 @@ "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_= RETIRED.ANY", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", @@ -1468,7 +1483,7 @@ "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", "MetricName": "tma_microcode_sequencer", "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", - "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers,= tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_clears, = tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_bottleneck_irreg= ular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_m= s_switches", "ScaleUnit": "100%" }, { @@ -1476,8 +1491,8 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL= _BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_inf= o_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_branch_mispredicts, tma_info_bad_spec_branch_misprediction_cost= , tma_info_bottleneck_mispredictions", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%" }, { @@ -1491,19 +1506,27 @@ }, { "BriefDescription": "This metric represents fraction of cycles whe= re (only) 4 uops were delivered by the MITE pipeline", - "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=3D4@ - cpu@IDQ.MITE_UO= PS\\,cmask\\=3D5@) / tma_info_thread_clks", + "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=3D0x4@ - cpu@IDQ.MITE_= UOPS\\,cmask\\=3D0x5@) / tma_info_thread_clks", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_mite_gr= oup", "MetricName": "tma_mite_4wide", - "MetricThreshold": "tma_mite_4wide > 0.05 & (tma_mite > 0.1 & tma_= fetch_bandwidth > 0.2)", + "MetricThreshold": "tma_mite_4wide > 0.05 & tma_mite > 0.1 & tma_f= etch_bandwidth > 0.2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", "MetricExpr": "UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY= ", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "MetricExpr": "cpu@IDQ.MS_UOPS\\,cmask\\=3D0x1@ / tma_info_core_co= re_clks / 2", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", + "MetricName": "tma_ms", + "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", "ScaleUnit": "100%" }, { @@ -1511,8 +1534,8 @@ "MetricExpr": "3 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_= clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operat= ion", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, { @@ -1520,7 +1543,7 @@ "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, @@ -1535,19 +1558,19 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { @@ -1583,7 +1606,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_6. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1591,8 +1614,8 @@ "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL)) / tma_= info_thread_clks if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CY= CLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring = * EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", "ScaleUnit": "100%" }, { @@ -1600,8 +1623,8 @@ "MetricExpr": "cpu@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=3D0x80@ / t= ma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", "ScaleUnit": "100%" }, { @@ -1609,7 +1632,7 @@ "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", "ScaleUnit": "100%" }, @@ -1618,7 +1641,7 @@ "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, @@ -1627,14 +1650,14 @@ "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_= thread_slots", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", "MetricGroup": "BvUW;Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", @@ -1647,7 +1670,7 @@ "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_thread_clks", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", "ScaleUnit": "100%" }, @@ -1656,7 +1679,7 @@ "MetricExpr": "140 * MISC_RETIRED.PAUSE_INST / tma_info_thread_clk= s", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_slow_pause", - "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: MISC_RETIRED.PAUS= E_INST", "ScaleUnit": "100%" }, @@ -1665,8 +1688,8 @@ "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", "ScaleUnit": "100%" }, { @@ -1675,17 +1698,17 @@ "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", "ScaleUnit": "100%" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", "MetricExpr": "L1D_PEND_MISS.L2_STALL / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_bottleneck_cache_memory_bandwidth, tma_info_system_dram_bw_use, = tma_mem_bandwidth", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%" }, { @@ -1693,8 +1716,8 @@ "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", "ScaleUnit": "100%" }, { @@ -1703,17 +1726,17 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", "MetricExpr": "(L2_RQSTS.RFO_HIT * 10 * (1 - MEM_INST_RETIRED.LOCK= _LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / = MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUEST= S_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -1730,7 +1753,7 @@ "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", "ScaleUnit": "100%" }, { @@ -1738,7 +1761,31 @@ "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1746,7 +1793,7 @@ "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", "MetricName": "tma_streaming_stores", - "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound = > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", "ScaleUnit": "100%" }, @@ -1755,7 +1802,7 @@ "MetricExpr": "10 * BACLEARS.ANY / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, @@ -1764,8 +1811,8 @@ "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", "ScaleUnit": "100%" }, { diff --git a/tools/perf/pmu-events/arch/x86/icelake/memory.json b/tools/per= f/pmu-events/arch/x86/icelake/memory.json index f73035f44330..abaf3f4f9d63 100644 --- a/tools/perf/pmu-events/arch/x86/icelake/memory.json +++ b/tools/perf/pmu-events/arch/x86/icelake/memory.json @@ -88,7 +88,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_128", "MSRIndex": "0x3F6", "MSRValue": "0x80", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 128 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "1009", "UMask": "0x1" @@ -101,7 +100,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_16", "MSRIndex": "0x3F6", "MSRValue": "0x10", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 16 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "20011", "UMask": "0x1" @@ -114,7 +112,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_256", "MSRIndex": "0x3F6", "MSRValue": "0x100", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 256 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "503", "UMask": "0x1" @@ -127,7 +124,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32", "MSRIndex": "0x3F6", "MSRValue": "0x20", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 32 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -140,7 +136,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4", "MSRIndex": "0x3F6", "MSRValue": "0x4", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 4 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "100003", "UMask": "0x1" @@ -153,7 +148,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_512", "MSRIndex": "0x3F6", "MSRValue": "0x200", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 512 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "101", "UMask": "0x1" @@ -166,7 +160,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64", "MSRIndex": "0x3F6", "MSRValue": "0x40", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 64 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "2003", "UMask": "0x1" @@ -179,7 +172,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_8", "MSRIndex": "0x3F6", "MSRValue": "0x8", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 8 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "50021", "UMask": "0x1" @@ -287,17 +279,16 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc9", "EventName": "RTM_RETIRED.ABORTED", - "PEBS": "1", "PublicDescription": "Counts the number of times RTM abort was tri= ggered.", "SampleAfterValue": "100003", "UMask": "0x4" }, { - "BriefDescription": "Number of times an RTM execution aborted due = to none of the previous 4 categories (e.g. interrupt)", + "BriefDescription": "Number of times an RTM execution aborted due = to none of the previous 3 categories (e.g. interrupt)", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc9", "EventName": "RTM_RETIRED.ABORTED_EVENTS", - "PublicDescription": "Counts the number of times an RTM execution = aborted due to none of the previous 4 categories (e.g. interrupt).", + "PublicDescription": "Counts the number of times an RTM execution = aborted due to none of the previous 3 categories (e.g. interrupt).", "SampleAfterValue": "100003", "UMask": "0x80" }, diff --git a/tools/perf/pmu-events/arch/x86/icelake/metricgroups.json b/too= ls/perf/pmu-events/arch/x86/icelake/metricgroups.json index 3a88260194d1..80ca8021f2de 100644 --- a/tools/perf/pmu-events/arch/x86/icelake/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/icelake/metricgroups.json @@ -37,6 +37,7 @@ "InsType": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -83,7 +84,9 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_mispredicts_group": "Metrics contributing to tma_branch_mi= spredicts category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", + "tma_code_stlb_miss_group": "Metrics contributing to tma_code_stlb_mis= s category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -93,6 +96,7 @@ "tma_fp_vector_group": "Metrics contributing to tma_fp_vector category= ", "tma_frontend_bound_group": "Metrics contributing to tma_frontend_boun= d category", "tma_heavy_operations_group": "Metrics contributing to tma_heavy_opera= tions category", + "tma_icache_misses_group": "Metrics contributing to tma_icache_misses = category", "tma_issue2P": "Metrics related by the issue $issue2P", "tma_issueBM": "Metrics related by the issue $issueBM", "tma_issueBW": "Metrics related by the issue $issueBW", @@ -112,10 +116,13 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", + "tma_l2_bound_group": "Metrics contributing to tma_l2_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", "tma_load_op_utilization_group": "Metrics contributing to tma_load_op_= utilization category", + "tma_load_stlb_miss_group": "Metrics contributing to tma_load_stlb_mis= s category", "tma_machine_clears_group": "Metrics contributing to tma_machine_clear= s category", "tma_mem_latency_group": "Metrics contributing to tma_mem_latency cate= gory", "tma_memory_bound_group": "Metrics contributing to tma_memory_bound ca= tegory", @@ -128,5 +135,6 @@ "tma_retiring_group": "Metrics contributing to tma_retiring category", "tma_serializing_operation_group": "Metrics contributing to tma_serial= izing_operation category", "tma_store_bound_group": "Metrics contributing to tma_store_bound cate= gory", - "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category" + "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category", + "tma_store_stlb_miss_group": "Metrics contributing to tma_store_stlb_m= iss category" } diff --git a/tools/perf/pmu-events/arch/x86/icelake/pipeline.json b/tools/p= erf/pmu-events/arch/x86/icelake/pipeline.json index 4fdf07c7beb7..44659f26cbab 100644 --- a/tools/perf/pmu-events/arch/x86/icelake/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/icelake/pipeline.json @@ -9,6 +9,15 @@ "SampleAfterValue": "1000003", "UMask": "0x9" }, + { + "BriefDescription": "ARITH.FP_DIVIDER_ACTIVE", + "Counter": "0,1,2,3,4,5,6,7", + "CounterMask": "1", + "EventCode": "0x14", + "EventName": "ARITH.FP_DIVIDER_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x1" + }, { "BriefDescription": "Number of occurrences where a microcode assis= t is invoked by hardware.", "Counter": "0,1,2,3,4,5,6,7", @@ -23,7 +32,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all branch instructions retired.", "SampleAfterValue": "400009" }, @@ -32,7 +40,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts conditional branch instructions retir= ed.", "SampleAfterValue": "400009", "UMask": "0x11" @@ -42,7 +49,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts not taken branch instructions retired= .", "SampleAfterValue": "400009", "UMask": "0x10" @@ -52,7 +58,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional branch instructions= retired.", "SampleAfterValue": "400009", "UMask": "0x1" @@ -62,7 +67,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.FAR_BRANCH", - "PEBS": "1", "PublicDescription": "Counts far branch instructions retired.", "SampleAfterValue": "100007", "UMask": "0x40" @@ -72,7 +76,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts near indirect branch instructions ret= ired excluding returns. TSX abort is an indirect branch.", "SampleAfterValue": "100003", "UMask": "0x80" @@ -82,7 +85,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_CALL", - "PEBS": "1", "PublicDescription": "Counts both direct and indirect near call in= structions retired.", "SampleAfterValue": "100007", "UMask": "0x2" @@ -92,7 +94,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_RETURN", - "PEBS": "1", "PublicDescription": "Counts return instructions retired.", "SampleAfterValue": "100007", "UMask": "0x8" @@ -102,7 +103,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken branch instructions retired.", "SampleAfterValue": "400009", "UMask": "0x20" @@ -112,7 +112,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all the retired branch instructions t= hat were mispredicted by the processor. A branch misprediction occurs when = the processor incorrectly predicts the destination of the branch. When the= misprediction is discovered at execution, all the instructions executed in= the wrong (speculative) path must be discarded, and the processor must sta= rt fetching from the correct path.", "SampleAfterValue": "50021" }, @@ -121,7 +120,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts mispredicted conditional branch instr= uctions retired.", "SampleAfterValue": "50021", "UMask": "0x11" @@ -131,7 +129,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts the number of conditional branch inst= ructions retired that were mispredicted and the branch direction was not ta= ken.", "SampleAfterValue": "50021", "UMask": "0x10" @@ -141,7 +138,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional mispredicted branch= instructions retired.", "SampleAfterValue": "50021", "UMask": "0x1" @@ -151,7 +147,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts all miss-predicted indirect branch in= structions retired (excluding RETs. TSX aborts is considered indirect branc= h).", "SampleAfterValue": "50021", "UMask": "0x80" @@ -161,7 +156,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT_CALL", - "PEBS": "1", "PublicDescription": "Counts retired mispredicted indirect (near t= aken) CALL instructions, including both register and memory indirect.", "SampleAfterValue": "50021", "UMask": "0x2" @@ -171,7 +165,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts number of near branch instructions re= tired that were mispredicted and taken.", "SampleAfterValue": "50021", "UMask": "0x20" @@ -181,7 +174,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.RET", - "PEBS": "1", "PublicDescription": "This is a non-precise version (that is, does= not use PEBS) of the event that counts mispredicted return instructions re= tired.", "SampleAfterValue": "50021", "UMask": "0x8" @@ -377,7 +369,6 @@ "BriefDescription": "Number of instructions retired. Fixed Counter= - architectural event", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts the number of instructions retired - = an Architectural PerfMon event. Counting continues during hardware interrup= ts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is counte= d by a designated fixed counter freeing up programmable counters to count o= ther events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003", "UMask": "0x1" @@ -387,7 +378,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.ANY_P", - "PEBS": "1", "PublicDescription": "Counts the number of instructions retired - = an Architectural PerfMon event. Counting continues during hardware interrup= ts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is counte= d by a designated fixed counter freeing up programmable counters to count o= ther events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003" }, @@ -396,7 +386,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.NOP", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x2" }, @@ -404,7 +393,6 @@ "BriefDescription": "Precise instruction retired event with a redu= ced effect of PEBS shadow in IP distribution", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.PREC_DIST", - "PEBS": "1", "PublicDescription": "A version of INST_RETIRED that allows for a = more unbiased distribution of samples across instructions retired. It utili= zes the Precise Distribution of Instructions Retired (PDIR) feature to miti= gate some bias in how retired instructions get sampled. Use on Fixed Counte= r 0.", "SampleAfterValue": "2000003", "UMask": "0x1" diff --git a/tools/perf/pmu-events/arch/x86/icelake/uncore-interconnect.jso= n b/tools/perf/pmu-events/arch/x86/icelake/uncore-interconnect.json index 909a73d7f2d3..bee503eda18c 100644 --- a/tools/perf/pmu-events/arch/x86/icelake/uncore-interconnect.json +++ b/tools/perf/pmu-events/arch/x86/icelake/uncore-interconnect.json @@ -8,71 +8,6 @@ "UMask": "0x1", "Unit": "ARB" }, - { - "BriefDescription": "This event is deprecated. Refer to new event = UNC_ARB_IFA_OCCUPANCY.ALL", - "Counter": "0", - "Deprecated": "1", - "EventCode": "0x85", - "EventName": "UNC_ARB_DAT_OCCUPANCY.ALL", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x1", - "Unit": "ARB" - }, - { - "BriefDescription": "This event is deprecated.", - "Counter": "0", - "Deprecated": "1", - "EventCode": "0x85", - "EventName": "UNC_ARB_DAT_OCCUPANCY.RD", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x2", - "Unit": "ARB" - }, - { - "BriefDescription": "This event is deprecated. Refer to new event = UNC_ARB_TRK_OCCUPANCY.RD", - "Counter": "0", - "Deprecated": "1", - "EventCode": "0x80", - "EventName": "UNC_ARB_REQ_TRK_OCCUPANCY.DRD", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x2", - "Unit": "ARB" - }, - { - "BriefDescription": "Number of all coherent Data Read entries. Doe= sn't include prefetches", - "Counter": "1", - "EventCode": "0x81", - "EventName": "UNC_ARB_REQ_TRK_REQUEST.DRD", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x2", - "Unit": "ARB" - }, - { - "BriefDescription": "This event is deprecated.", - "Counter": "0", - "Deprecated": "1", - "EventCode": "0x80", - "EventName": "UNC_ARB_TRK_OCCUPANCY.ALL", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x1", - "Unit": "ARB" - }, - { - "BriefDescription": "This event is deprecated. Refer to new event = UNC_ARB_REQ_TRK_OCCUPANCY.DRD", - "Counter": "0", - "Deprecated": "1", - "EventCode": "0x80", - "EventName": "UNC_ARB_TRK_OCCUPANCY.RD", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x2", - "Unit": "ARB" - }, { "BriefDescription": "Total number of all outgoing entries allocate= d. Accounts for Coherent and non-coherent traffic.", "Counter": "1", @@ -81,16 +16,5 @@ "PerPkg": "1", "UMask": "0x1", "Unit": "ARB" - }, - { - "BriefDescription": "This event is deprecated. Refer to new event = UNC_ARB_REQ_TRK_REQUEST.DRD", - "Counter": "0,1", - "Deprecated": "1", - "EventCode": "0x81", - "EventName": "UNC_ARB_TRK_REQUESTS.RD", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x2", - "Unit": "ARB" } ] diff --git a/tools/perf/pmu-events/arch/x86/icelake/uncore-other.json b/too= ls/perf/pmu-events/arch/x86/icelake/uncore-other.json index cc8110ac020c..1ac5b5ef8094 100644 --- a/tools/perf/pmu-events/arch/x86/icelake/uncore-other.json +++ b/tools/perf/pmu-events/arch/x86/icelake/uncore-other.json @@ -1,6 +1,6 @@ [ { - "BriefDescription": "UNC_CLOCK.SOCKET", + "BriefDescription": "This 48-bit fixed counter counts the UCLK cyc= les.", "Counter": "FIXED", "EventCode": "0xff", "EventName": "UNC_CLOCK.SOCKET", diff --git a/tools/perf/pmu-events/arch/x86/icelake/virtual-memory.json b/t= ools/perf/pmu-events/arch/x86/icelake/virtual-memory.json index 3ff51040f84f..9df790d4361f 100644 --- a/tools/perf/pmu-events/arch/x86/icelake/virtual-memory.json +++ b/tools/perf/pmu-events/arch/x86/icelake/virtual-memory.json @@ -27,6 +27,15 @@ "SampleAfterValue": "100003", "UMask": "0xe" }, + { + "BriefDescription": "Page walks completed due to a demand data loa= d to a 1G page.", + "Counter": "0,1,2,3", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_1G", + "PublicDescription": "Counts completed page walks (1G sizes) caus= ed by demand data loads. This implies address translations missed in the DT= LB and further levels of TLB. The page walk can end with or without a fault= .", + "SampleAfterValue": "100003", + "UMask": "0x8" + }, { "BriefDescription": "Page walks completed due to a demand data loa= d to a 2M/4M page.", "Counter": "0,1,2,3", @@ -82,6 +91,15 @@ "SampleAfterValue": "100003", "UMask": "0xe" }, + { + "BriefDescription": "Page walks completed due to a demand data sto= re to a 1G page.", + "Counter": "0,1,2,3", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_1G", + "PublicDescription": "Counts completed page walks (1G sizes) caus= ed by demand data stores. This implies address translations missed in the D= TLB and further levels of TLB. The page walk can end with or without a faul= t.", + "SampleAfterValue": "100003", + "UMask": "0x8" + }, { "BriefDescription": "Page walks completed due to a demand data sto= re to a 2M/4M page.", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 4d7baf4d6adf..b6a0b67a44b8 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -16,7 +16,7 @@ GenuineIntel-6-B6,v1.05,grandridge,core GenuineIntel-6-A[DE],v1.04,graniterapids,core GenuineIntel-6-(3C|45|46),v36,haswell,core GenuineIntel-6-3F,v29,haswellx,core -GenuineIntel-6-7[DE],v1.22,icelake,core +GenuineIntel-6-7[DE],v1.24,icelake,core GenuineIntel-6-6[AC],v1.26,icelakex,core GenuineIntel-6-3A,v24,ivybridge,core GenuineIntel-6-3E,v24,ivytown,core --=20 2.48.0.rc2.279.g1de40edade-goog From nobody Sun Dec 14 13:49:51 2025 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D0E0A197A76 for ; Thu, 16 Jan 2025 06:44:43 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009909; cv=none; b=kLjE8W8vdyCW3sMY6qiyHT6TvsAJ+k/0kWNMILoqjyG4jevWZ84ain31OdlQBQ/p5yzjIBjCKtJZ16gb7pgYwG5yHSYrAiyyx9AMgoj5QRkjLQMcYXE1pHDC2CyGtEC8GeuxBDimUaXoO7NZCnm7o3PIjwPVZAxXns/nos/vv7w= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009909; c=relaxed/simple; bh=u4GdHkv1wVR5k+qcJPOcpr6JDZkRiDrqdyoooflIe+4=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=mZUcu16cUUu919PdbsRkycztGrSnyCa3zgfNUm5OmC7+CvPJWYE740k9llhULdrBJi4rFaDuvrber0iOVpbHlCGvN+vrRE7e3YUVSDaatt2/iXMCdgyqAp5U1YRMr96h6HVPEpj6hMuCVzWeahxiF4yRfCMuGrSidYGCh0HyJxQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=FsHQO6JM; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="FsHQO6JM" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e549c458692so1544205276.2 for ; Wed, 15 Jan 2025 22:44:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737009882; x=1737614682; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=dnYKtbDwDLvBHPexbpVJf+2ftNL6VMoPLx8vn16jeM4=; b=FsHQO6JMQzm/z1il/ia7Wf7thkzt+Es2WajR8v1gNQ3hjoLd4hhYNNZ0kGQVO+VEEy iIpv7/RqVORPwWYd/vpeQyP+jDZ+9/YkYC9YiHDw0mUud5GHCkH/KGKbAJOANoxG9q8e jp3/shNzChsA+D08cspuiLTznSO2gyFB45AYjM7+UD9w0aKHBRa+9BrST+IXv3n2zdsp WrBjtQFGhT0jQSH1t5B2K1mxcwyNROYhfVXXgqfgquXbX++Wgl+BceGbLZjGrp/zOZFn 10fmEfAKvrWqexcajVfXzawCCxGFQl5n7FNoGzAsKTQeoHZCGFy/4/jxtDq8xgKCEy5R UCZw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737009882; x=1737614682; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=dnYKtbDwDLvBHPexbpVJf+2ftNL6VMoPLx8vn16jeM4=; b=ry+Skli50UkjyCzysGEodm3tkcrXAUpLJUvBi7+XntWh1c9yubSctBVR+RNLLW7d48 fF9yW0Fdi3syHFNrUWSkSxeckbxfMB6SA/6lhLMuGeREWlHlOSfKBd2bHJ6bZVIYo5Gs MSL15zggjEDLd34fo/qp3gRZIxHkNfCLdb6lLGyFC2cTwM36gFNZNS91ZIDk7VvtB4u7 JNR1UXVKTEnzJu3z28Pp0Tui+172+YNbBNcaxWI18xHbvkWmqoj5S4lFr3XcCSeNqUBk GzuB541AjC5hkgaHTlwryLgj99sZ877ESUes/tb/EmlAskYCRYFayJZdxmSOc+7OBiAz B/sA== X-Forwarded-Encrypted: i=1; AJvYcCVDi2vyVmhFTWb0XOMwwMYR32A5/IpYkSkJBXx3F1BAisGSIfNuYyGvipq/vh0/W3f6y2ehtH3MGuHoo00=@vger.kernel.org X-Gm-Message-State: AOJu0Yy049/eFTmL982aGfUZ/F8DBA+JN+RFDHqJHjAUaJ5PhSHq4AKE 63YR5I5ylO2ppqgKp2jwpCU36Uzk7TsLUVeLHbwoHPTW5GVN4UAbf4swXhqMp3C5b/vi4sXy07X 45lSy4w== X-Google-Smtp-Source: AGHT+IGtdidkHU/VtmzNl2Ce7uqdeyiFlvrq7RvBspqopFYc7C5IXcGR18OLO/KN56m6T5n6qFNEZjCddu5J X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:7bd:ce19:c701:bfa3]) (user=irogers job=sendgmr) by 2002:a25:2085:0:b0:e3c:9f2c:707b with SMTP id 3f1490d57ef6-e54edf1eaa9mr75304276.1.1737009882437; Wed, 15 Jan 2025 22:44:42 -0800 (PST) Date: Wed, 15 Jan 2025 22:43:47 -0800 In-Reply-To: <20250116064355.344245-1-irogers@google.com> Message-Id: <20250116064355.344245-16-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250116064355.344245-1-irogers@google.com> X-Mailer: git-send-email 2.48.0.rc2.279.g1de40edade-goog Subject: [PATCH v2 15/23] perf vendor events: Update IcelakeX events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.26 to v1.27. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v1.27: https://github.com/intel/perfmon/commit/6ee80d0532a778caee68d6e29d8e0527856= 7e69f The TMA 5.01 update is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-developed-by: Caleb Biggers Signed-off-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../pmu-events/arch/x86/icelakex/cache.json | 41 +- .../arch/x86/icelakex/frontend.json | 17 - .../arch/x86/icelakex/icx-metrics.json | 852 ++++++++++-------- .../pmu-events/arch/x86/icelakex/memory.json | 13 +- .../arch/x86/icelakex/metricgroups.json | 10 +- .../arch/x86/icelakex/pipeline.json | 30 +- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 7 files changed, 533 insertions(+), 432 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/icelakex/cache.json b/tools/per= f/pmu-events/arch/x86/icelakex/cache.json index 0cbb9d6a3ec1..e8ab6ef2cd50 100644 --- a/tools/perf/pmu-events/arch/x86/icelakex/cache.json +++ b/tools/perf/pmu-events/arch/x86/icelakex/cache.json @@ -75,14 +75,23 @@ "UMask": "0x2" }, { - "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache when triggered by an L2 cache fill.", + "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache.", "Counter": "0,1,2,3", "EventCode": "0xF2", "EventName": "L2_LINES_OUT.SILENT", - "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache when triggered by an L2 cache fill. These lines are ty= pically in Shared or Exclusive state. A non-threaded event.", + "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache. These lines are typically in Shared or Exclusive stat= e. A non-threaded event.", "SampleAfterValue": "200003", "UMask": "0x1" }, + { + "BriefDescription": "Cache lines that have been L2 hardware prefet= ched but not used by demand accesses", + "Counter": "0,1,2,3", + "EventCode": "0xf2", + "EventName": "L2_LINES_OUT.USELESS_HWPF", + "PublicDescription": "Counts the number of cache lines that have b= een prefetched by the L2 hardware prefetcher but not used by demand access = when evicted from the L2 cache", + "SampleAfterValue": "200003", + "UMask": "0x4" + }, { "BriefDescription": "L2 code requests", "Counter": "0,1,2,3", @@ -224,7 +233,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_LOADS", - "PEBS": "1", "PublicDescription": "Counts all retired load instructions. This e= vent accounts for SW prefetch instructions of PREFETCHNTA or PREFETCHT0/1/2= or PREFETCHW.", "SampleAfterValue": "1000003", "UMask": "0x81" @@ -235,7 +243,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_STORES", - "PEBS": "1", "PublicDescription": "Counts all retired store instructions.", "SampleAfterValue": "1000003", "UMask": "0x82" @@ -246,7 +253,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts all retired memory instructions - loa= ds and stores.", "SampleAfterValue": "1000003", "UMask": "0x83" @@ -257,7 +263,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.LOCK_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with locked= access.", "SampleAfterValue": "100007", "UMask": "0x21" @@ -268,7 +273,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions that split = across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x41" @@ -279,7 +283,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_STORES", - "PEBS": "1", "PublicDescription": "Counts retired store instructions that split= across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x42" @@ -290,7 +293,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_LOADS", - "PEBS": "1", "PublicDescription": "Number of retired load instructions that (st= art a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x11" @@ -301,7 +303,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_STORES", - "PEBS": "1", "PublicDescription": "Number of retired store instructions that (s= tart a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x12" @@ -312,7 +313,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were HitM responses from shared L3.", "SampleAfterValue": "20011", "UMask": "0x4" @@ -324,7 +324,6 @@ "Deprecated": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT", - "PEBS": "1", "SampleAfterValue": "20011", "UMask": "0x2" }, @@ -335,7 +334,6 @@ "Deprecated": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM", - "PEBS": "1", "SampleAfterValue": "20011", "UMask": "0x4" }, @@ -345,7 +343,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS", - "PEBS": "1", "PublicDescription": "Counts the retired load instructions whose d= ata sources were L3 hit and cross-core snoop missed in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x1" @@ -356,7 +353,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NONE", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were hits in L3 without snoops required.", "SampleAfterValue": "100003", "UMask": "0x8" @@ -367,7 +363,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were L3 and cross-core snoop hits in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x2" @@ -378,7 +373,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM", - "PEBS": "1", "PublicDescription": "Retired load instructions which data sources= missed L3 but serviced from local DRAM.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -389,7 +383,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM", - "PEBS": "1", "SampleAfterValue": "100007", "UMask": "0x2" }, @@ -399,7 +392,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD", - "PEBS": "1", "PublicDescription": "Retired load instructions whose data sources= was forwarded from a remote cache.", "SampleAfterValue": "100007", "UMask": "0x8" @@ -410,7 +402,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM", - "PEBS": "1", "PublicDescription": "Retired load instructions whose data sources= was remote HITM.", "SampleAfterValue": "100007", "UMask": "0x4" @@ -421,7 +412,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_PMM", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with remote= Intel(R) Optane(TM) DC persistent memory as the data source and the data r= equest missed L3 (AppDirect or Memory Mode) and DRAM cache(Memory Mode).", "SampleAfterValue": "100007", "UMask": "0x10" @@ -432,7 +422,6 @@ "Data_LA": "1", "EventCode": "0xd4", "EventName": "MEM_LOAD_MISC_RETIRED.UC", - "PEBS": "1", "PublicDescription": "Retired instructions with at least one load = to uncacheable memory-type, or at least one cache-line split locked access = (Bus Lock).", "SampleAfterValue": "100007", "UMask": "0x4" @@ -443,7 +432,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.FB_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop was load missed in L1 but hit FB (Fill Buffers) due to preceding= miss to the same cache line with data not ready.", "SampleAfterValue": "100007", "UMask": "0x40" @@ -454,7 +442,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L1 data cache. This event includes all SW prefet= ches and lock instructions regardless of the data source.", "SampleAfterValue": "1000003", "UMask": "0x1" @@ -465,7 +452,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L1 cache.", "SampleAfterValue": "200003", "UMask": "0x8" @@ -476,7 +462,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with L2 cac= he hits as data sources.", "SampleAfterValue": "200003", "UMask": "0x2" @@ -487,7 +472,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions missed L2 c= ache as data sources.", "SampleAfterValue": "100021", "UMask": "0x10" @@ -498,7 +482,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L3 cache.", "SampleAfterValue": "100021", "UMask": "0x4" @@ -509,7 +492,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L3 cache.", "SampleAfterValue": "50021", "UMask": "0x20" @@ -520,7 +502,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.LOCAL_PMM", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with local = Intel(R) Optane(TM) DC persistent memory as the data source and the data re= quest missed L3 (AppDirect or Memory Mode) and DRAM cache(Memory Mode).", "SampleAfterValue": "100003", "UMask": "0x80" diff --git a/tools/perf/pmu-events/arch/x86/icelakex/frontend.json b/tools/= perf/pmu-events/arch/x86/icelakex/frontend.json index d79ddc15b220..3b1bdcfaa276 100644 --- a/tools/perf/pmu-events/arch/x86/icelakex/frontend.json +++ b/tools/perf/pmu-events/arch/x86/icelakex/frontend.json @@ -44,7 +44,6 @@ "EventName": "FRONTEND_RETIRED.ANY_DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x1", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= DSB (Decode stream buffer i.e. the decoded instruction-cache) miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -56,7 +55,6 @@ "EventName": "FRONTEND_RETIRED.DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x11", - "PEBS": "1", "PublicDescription": "Number of retired Instructions that experien= ced a critical DSB (Decode stream buffer i.e. the decoded instruction-cache= ) miss. Critical means stalls were exposed to the back-end as a result of t= he DSB miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -68,7 +66,6 @@ "EventName": "FRONTEND_RETIRED.ITLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x14", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= iTLB (Instruction TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -80,7 +77,6 @@ "EventName": "FRONTEND_RETIRED.L1I_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x12", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L1 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -92,7 +88,6 @@ "EventName": "FRONTEND_RETIRED.L2_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x13", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L2 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -104,7 +99,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x500106", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 1 cycle which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -116,7 +110,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_128", "MSRIndex": "0x3F7", "MSRValue": "0x508006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 12= 8 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -128,7 +121,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_16", "MSRIndex": "0x3F7", "MSRValue": "0x501006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 16 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -140,7 +132,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2", "MSRIndex": "0x3F7", "MSRValue": "0x500206", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -152,7 +143,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_256", "MSRIndex": "0x3F7", "MSRValue": "0x510006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 25= 6 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -164,7 +154,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x100206", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after the front-end had at least 1 bubble-slot for a per= iod of 2 cycles. A bubble-slot is an empty issue-pipeline slot while there = was no RAT stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -176,7 +165,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_32", "MSRIndex": "0x3F7", "MSRValue": "0x502006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 32 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -188,7 +176,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_4", "MSRIndex": "0x3F7", "MSRValue": "0x500406", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 4 = cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -200,7 +187,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_512", "MSRIndex": "0x3F7", "MSRValue": "0x520006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 51= 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -212,7 +198,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_64", "MSRIndex": "0x3F7", "MSRValue": "0x504006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 64= cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -224,7 +209,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_8", "MSRIndex": "0x3F7", "MSRValue": "0x500806", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 8 cycles. During thi= s period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -236,7 +220,6 @@ "EventName": "FRONTEND_RETIRED.STLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x15", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= STLB (2nd level TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x1" diff --git a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json b/too= ls/perf/pmu-events/arch/x86/icelakex/icx-metrics.json index db5510ba9099..7bee03e532e4 100644 --- a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json +++ b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json @@ -34,7 +34,7 @@ "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles.", + "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles", "MetricExpr": "CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY", "MetricName": "cpi", "ScaleUnit": "1per_instr" @@ -55,49 +55,55 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte page sizes) caused by demand data loads to the total number of c= ompleted instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRE= D.ANY", "MetricName": "dtlb_2nd_level_2mb_large_page_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte page sizes) caused by demand data loads to the total number of = completed instructions. This implies it missed in the Data Translation Look= aside Buffer (DTLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte page sizes) caused by demand data loads to the total number of = completed instructions. This implies it missed in the Data Translation Look= aside Buffer (DTLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data loads to the total number of complete= d instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "dtlb_2nd_level_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data stores to the total number of complet= ed instructions", "MetricExpr": "DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", "MetricName": "dtlb_2nd_level_store_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB", "ScaleUnit": "1per_instr" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the local CPU socket.= ", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU", + "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_HIT_PCIRDCUR + UNC_CHA_TOR_= INSERTS.IO_MISS_PCIRDCUR) * 64 / 1e6 / duration_time", + "MetricName": "io_bandwidth_read", + "ScaleUnit": "1MB/s" + }, + { + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the local CPU socket", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR_LOCAL * 64 / 1e6 / = duration_time", "MetricName": "io_bandwidth_read_local", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from a remote CPU socket.", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from a remote CPU socket", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR_REMOTE * 64 / 1e6 /= duration_time", "MetricName": "io_bandwidth_read_remote", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_HIT_ITOM + UNC_CHA_TOR_INSE= RTS.IO_MISS_ITOM + UNC_CHA_TOR_INSERTS.IO_HIT_ITOMCACHENEAR + UNC_CHA_TOR_I= NSERTS.IO_MISS_ITOMCACHENEAR) * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_write", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the local CPU socket.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the local CPU socket", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM_LOCAL + UNC_CHA_TOR_IN= SERTS.IO_ITOMCACHENEAR_LOCAL) * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_write_local", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to a remote CPU socket.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to a remote CPU socket", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM_REMOTE + UNC_CHA_TOR_I= NSERTS.IO_ITOMCACHENEAR_REMOTE) * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_write_remote", "ScaleUnit": "1MB/s" @@ -106,14 +112,14 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total n= umber of completed instructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY= ", "MetricName": "itlb_2nd_level_large_page_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by a code fetch to the total number of completed ins= tructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "itlb_2nd_level_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB", "ScaleUnit": "1per_instr" }, { @@ -164,6 +170,12 @@ "MetricName": "llc_code_read_mpi_demand_plus_prefetch", "ScaleUnit": "1per_instr" }, + { + "BriefDescription": "Ratio of number of data read requests missing= last level core cache (includes demand w/ prefetches) to the total number = of completed instructions", + "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_LLCPREFDATA + UNC_CHA_= TOR_INSERTS.IA_MISS_DRD + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF) / INST_RETI= RED.ANY", + "MetricName": "llc_data_read_mpi_demand_plus_prefetch", + "ScaleUnit": "1per_instr" + }, { "BriefDescription": "Average latency of a last level cache (LLC) d= emand data read miss (read memory access) in nano seconds", "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_= TOR_INSERTS.IA_MISS_DRD) / (UNC_CHA_CLOCKTICKS / (source_count(UNC_CHA_TOR_= OCCUPANCY.IA_MISS_DRD) * #num_packages)) * duration_time", @@ -182,6 +194,12 @@ "MetricName": "llc_demand_data_read_miss_latency_for_remote_reques= ts", "ScaleUnit": "1ns" }, + { + "BriefDescription": "Average latency of a last level cache (LLC) d= emand data read miss (read memory access) addressed to DRAM in nano seconds= ", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR / UNC_= CHA_TOR_INSERTS.IA_MISS_DRD_DDR) / (UNC_CHA_CLOCKTICKS / (source_count(UNC_= CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR) * #num_packages)) * duration_time", + "MetricName": "llc_demand_data_read_miss_to_dram_latency", + "ScaleUnit": "1ns" + }, { "BriefDescription": "Average latency of a last level cache (LLC) d= emand data read miss (read memory access) addressed to Intel(R) Optane(TM) = Persistent Memory(PMEM) in nano seconds", "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_PMM / UNC_= CHA_TOR_INSERTS.IA_MISS_DRD_PMM) / (UNC_CHA_CLOCKTICKS / (source_count(UNC_= CHA_TOR_OCCUPANCY.IA_MISS_DRD_PMM) * #num_packages)) * duration_time", @@ -189,25 +207,25 @@ "ScaleUnit": "1ns" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to local memory.", + "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to local memory", "MetricExpr": "UNC_CHA_REQUESTS.READS_LOCAL * 64 / 1e6 / duration_= time", "MetricName": "llc_miss_local_memory_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to local memory.", + "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to local memory", "MetricExpr": "UNC_CHA_REQUESTS.WRITES_LOCAL * 64 / 1e6 / duration= _time", "MetricName": "llc_miss_local_memory_bandwidth_write", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to remote memory.", + "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to remote memory", "MetricExpr": "UNC_CHA_REQUESTS.READS_REMOTE * 64 / 1e6 / duration= _time", "MetricName": "llc_miss_remote_memory_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to remote memory.", + "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to remote memory", "MetricExpr": "UNC_CHA_REQUESTS.WRITES_REMOTE * 64 / 1e6 / duratio= n_time", "MetricName": "llc_miss_remote_memory_bandwidth_write", "ScaleUnit": "1MB/s" @@ -237,19 +255,19 @@ "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Memory write bandwidth (MB/sec) caused by dir= ectory updates; includes DDR and Intel(R) Optane(TM) Persistent Memory(PMEM= ).", + "BriefDescription": "Memory write bandwidth (MB/sec) caused by dir= ectory updates; includes DDR and Intel(R) Optane(TM) Persistent Memory(PMEM= )", "MetricExpr": "(UNC_CHA_DIR_UPDATE.HA + UNC_CHA_DIR_UPDATE.TOR + U= NC_M2M_DIRECTORY_UPDATE.ANY) * 64 / 1e6 / duration_time", "MetricName": "memory_extra_write_bw_due_to_directory_updates", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches.", + "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCAL + UNC_CHA_TO= R_INSERTS.IA_MISS_DRD_PREF_LOCAL) / (UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCAL = + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_LOCAL + UNC_CHA_TOR_INSERTS.IA_MISS_= DRD_REMOTE + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_REMOTE)", "MetricName": "numa_reads_addressed_to_local_dram", "ScaleUnit": "100%" }, { - "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches.", + "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_DRD_REMOTE + UNC_CHA_T= OR_INSERTS.IA_MISS_DRD_PREF_REMOTE) / (UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCA= L + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_LOCAL + UNC_CHA_TOR_INSERTS.IA_MIS= S_DRD_REMOTE + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_REMOTE)", "MetricName": "numa_reads_addressed_to_remote_dram", "ScaleUnit": "100%" @@ -317,12 +335,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", + "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + = UOPS_DISPATCHED.PORT_5 + UOPS_DISPATCHED.PORT_6) / (4 * tma_info_core_core_= clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -334,7 +352,7 @@ "MetricExpr": "34 * ASSISTS.ANY / tma_info_thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%" }, @@ -357,11 +375,104 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions.", + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Ret", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4k_alias= ing + tma_fb_full)))", + "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_4k_aliasing + tma_fb_fu= ll)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tm= a_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_lock_latency / (tma_= dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma_lock_latenc= y + tma_split_loads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * = (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_boun= d + tma_store_bound)) * (tma_split_loads / (tma_dtlb_load + tma_store_fwd_b= lk + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4= k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_= bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * = (tma_split_stores / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store)) + tma_memory_bound * (tma_stor= e_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tm= a_store_bound)) * (tma_store_latency / (tma_store_latency + tma_false_shari= ng + tma_split_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * (tma_fetch_latency * (tma_ms_swi= tches + tma_branch_resteers * (tma_clears_resteers + 10 * tma_microcode_seq= uencer * tma_other_mispredicts / tma_branch_mispredicts * tma_mispredicts_r= esteers) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_br= anches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma= _ms_switches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms /= (tma_mite + tma_dsb + tma_ms))) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Fed;FetchBW;Frontend", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * (tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_cle= ars_resteers + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_b= ranch_mispredicts * tma_mispredicts_resteers) / (tma_mispredicts_resteers += tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_it= lb_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switc= hes) + tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_ms)) + 10 *= tma_microcode_sequencer * tma_other_mispredicts / tma_branch_mispredicts *= tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / tma_other_= nukes + tma_core_bound * (tma_serializing_operation + tma_core_bound * RS_E= VENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_0) / (tma_di= vider + tma_serializing_operation + tma_ports_utilization) + tma_microcode_= sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) * (tma_as= sists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound= / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store= _bound)) * (tma_dtlb_store / (tma_store_latency + tma_false_sharing + tma_s= plit_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) * tma_remote_cach= e / (tma_local_mem + tma_remote_mem + tma_remote_cache) + tma_l3_bound / (t= ma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_boun= d) * (tma_contested_accesses + tma_data_sharing) / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / = (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bo= und) * tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_spl= it_stores + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + t= ma_machine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears, tma_remote_cache" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Offcore", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Ret", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions", "MetricExpr": "tma_light_operations * BR_INST_RETIRED.ALL_BRANCHES= / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_branch_instructions", @@ -375,7 +486,7 @@ "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredic= tions, tma_mispredicts_resteers", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost= , tma_mispredicts_resteers", "ScaleUnit": "100%" }, { @@ -383,8 +494,8 @@ "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -392,8 +503,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", "ScaleUnit": "100%" }, { @@ -401,18 +512,66 @@ "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRE= D.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / t= ma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "MetricExpr": "max(0, tma_icache_misses - tma_code_l2_miss)", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_hit", + "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_COD= E_RD / tma_info_thread_clks", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_miss", + "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_2M_= 4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_4K = / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(44 * tma_info_system_core_frequency * (MEM_LOAD_L3= _HIT_RETIRED.XSNP_HITM * (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAN= D_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD)= )) + 43.5 * tma_info_system_core_frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_M= ISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_i= nfo_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricExpr": "((48 * tma_info_system_core_frequency - 4 * tma_inf= o_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (OCR.DEMAND= _DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DE= MAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) + (47.5 * tma_info_system_core_fr= equency - 4 * tma_info_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XSN= P_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tm= a_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related= metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_fals= e_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { @@ -422,25 +581,25 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "43.5 * tma_info_system_core_frequency * (MEM_LOAD_L= 3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (1 - OCR.DEMAN= D_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.D= EMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT /= MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(47.5 * tma_info_system_core_frequency - 4 * tma_in= fo_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_L3= _HIT_RETIRED.XSNP_HITM * (1 - OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.D= EMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_= FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma= _info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT. Related metrics: tma_bottleneck_memory_synchron= ization, tma_contested_accesses, tma_false_sharing, tma_machine_clears, tma= _remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cpu@INS= T_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", + "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%" }, @@ -449,18 +608,18 @@ "MetricExpr": "ARITH.DIVIDER_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates how often the CPU was s= talled on accesses to external memory (DRAM) by loads", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_thread_cl= ks + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma= _info_thread_clks - tma_l2_bound - tma_pmm_bound if #has_pmem > 0 else CYCL= E_ACTIVITY.STALLS_L3_MISS / tma_info_thread_clks + (CYCLE_ACTIVITY.STALLS_L= 1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_info_thread_clks - tma_l2_bo= und)", + "MetricExpr": "CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_thread_clk= s + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks - tma_l2_bound", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS_PS", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%" }, { @@ -469,7 +628,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", "ScaleUnit": "100%" }, { @@ -477,44 +636,44 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D1= @ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE= _ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D0= x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYC= LE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store, tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_me= mory_synchronization", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D1@ = + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", + "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D0x1= @ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, = tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_memory_synchroniz= ation", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", - "MetricExpr": "48 * tma_info_system_core_frequency * OCR.DEMAND_RF= O.L3_HIT.SNOOP_HITM / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricExpr": "(120 * tma_info_system_core_frequency * cpu@OCR.DEM= AND_RFO.L3_MISS\\,offcore_rsp\\=3D0x103b800002@ + 48 * tma_info_system_core= _frequency * OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM) / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing= , tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cac= he", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricExpr": "L1D_PEND_MISS.FB_FULL / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_bottleneck_cache_memory_bandwi= dth, tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store= _latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy, tma_streaming_stores", "ScaleUnit": "100%" }, { @@ -524,7 +683,7 @@ "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LA= TENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_sw= itches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tm= a_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { @@ -534,16 +693,16 @@ "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or up to= ([SNB+] four; [ADL+] five) uops", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", "MetricExpr": "tma_heavy_operations - tma_microcode_sequencer", "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", "MetricName": "tma_few_uops_instructions", "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or up t= o ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number o= f uops in such instructions. Related metrics: tma_decoder0_alone", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions.= Related metrics: tma_decoder0_alone", "ScaleUnit": "100%" }, { @@ -552,7 +711,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", "ScaleUnit": "100%" }, { @@ -561,7 +720,15 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "MetricExpr": "ARITH.FP_DIVIDER_ACTIVE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_fp_divider", + "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -569,7 +736,7 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -578,7 +745,7 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -587,8 +754,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -596,8 +763,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -605,7 +772,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.512B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_512b", - "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -617,17 +784,17 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", - "MetricExpr": "tma_microcode_sequencer + tma_retiring * (UOPS_DECO= DED.DEC0 - cpu@UOPS_DECODED.DEC0\\,cmask\\=3D1@) / IDQ.MITE_UOPS", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "MetricExpr": "tma_microcode_sequencer + tma_retiring * (UOPS_DECO= DED.DEC0 - cpu@UOPS_DECODED.DEC0\\,cmask\\=3D0x1@) / IDQ.MITE_UOPS", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { @@ -635,41 +802,41 @@ "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA sl= ots wasted per non-speculative branch misprediction (retired JEClear)", + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "tma_info_bottleneck_mispredictions * tma_info_threa= d_slots / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 5 / BR_MISP_RETIRED.ALL_BRANCHES / 100", "MetricGroup": "Bad;BrMispredicts;tma_issueBM", "MetricName": "tma_info_bad_spec_branch_misprediction_cost", - "PublicDescription": "Branch Misprediction Cost: Fraction of TMA s= lots wasted per non-speculative branch misprediction (retired JEClear). Rel= ated metrics: tma_branch_mispredicts, tma_info_bottleneck_mispredictions, t= ma_mispredicts_resteers" + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional taken branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_taken < 200" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" }, { - "BriefDescription": "Instructions per retired mispredicts for retu= rn branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_ret", @@ -683,7 +850,7 @@ "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200" }, { - "BriefDescription": "Speculative to Retired ratio of all clears (c= overing mispredicts and nukes)", + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", "MetricGroup": "BrMispredicts", "MetricName": "tma_info_bad_spec_spec_clears_ratio" @@ -698,8 +865,8 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_mite= )))", - "MetricGroup": "DSB;FetchBW;tma_issueFB", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_ms)))", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp" @@ -707,7 +874,7 @@ { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_mite))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_ms))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -716,108 +883,10 @@ { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", - "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", - "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "tma_info_bottleneck_big_code", - "MetricThreshold": "tma_info_bottleneck_big_code > 20" - }, - { - "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", - "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", - "MetricGroup": "BvBO;Ret", - "MetricName": "tma_info_bottleneck_branching_overhead", - "MetricThreshold": "tma_info_bottleneck_branching_overhead > 5", - "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma= _store_bound)) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)= ) + tma_memory_bound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma= _l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound)) * (tma_sq_full= / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq= _full)) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tma_l1_bound= + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound)) * (tma_f= b_full / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_l1_hit_latenc= y + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)))", - "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", - "MetricName": "tma_info_bottleneck_cache_memory_bandwidth", - "MetricThreshold": "tma_info_bottleneck_cache_memory_bandwidth > 2= 0", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma= _store_bound)) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) = + tma_memory_bound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l= 2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound)) * (tma_l3_hit_la= tency / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + t= ma_sq_full)) + tma_memory_bound * tma_l2_bound / (tma_dram_bound + tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) + tma= _memory_bound * (tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_= bound + tma_l3_bound + tma_pmm_bound + tma_store_bound)) * (tma_store_laten= cy / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_lat= ency + tma_streaming_stores)) + tma_memory_bound * (tma_l1_bound / (tma_dra= m_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_= store_bound)) * (tma_l1_hit_latency / (tma_4k_aliasing + tma_dtlb_load + tm= a_fb_full + tma_l1_hit_latency + tma_lock_latency + tma_split_loads + tma_s= tore_fwd_blk)))", - "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", - "MetricName": "tma_info_bottleneck_cache_memory_latency", - "MetricThreshold": "tma_info_bottleneck_cache_memory_latency > 20", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" - }, - { - "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_ports_utilization + tma_serializing_operation) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_ports_utilization + tma_serializin= g_operation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", - "MetricGroup": "BvCB;Cor;tma_issueComp", - "MetricName": "tma_info_bottleneck_compute_bound_est", - "MetricThreshold": "tma_info_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * tma_fetch_latency * (tma_ms_swit= ches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteer= s * (10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts)) / (tma_clears_resteers + tma_mispredicts_resteers + tma_unknown_b= ranches)) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + t= ma_itlb_misses + tma_lcp + tma_ms_switches)) - tma_info_bottleneck_big_code= ", - "MetricGroup": "BvFB;Fed;FetchBW;Frontend", - "MetricName": "tma_info_bottleneck_instruction_fetch_bw", - "MetricThreshold": "tma_info_bottleneck_instruction_fetch_bw > 20" - }, - { - "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clea= rs_resteers + tma_mispredicts_resteers * (10 * tma_microcode_sequencer * tm= a_other_mispredicts / tma_branch_mispredicts)) / (tma_clears_resteers + tma= _mispredicts_resteers + tma_unknown_branches)) / (tma_branch_resteers + tma= _dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_swit= ches) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_m= ispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes = / tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_= bound * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_= 0) / (tma_divider + tma_ports_utilization + tma_serializing_operation) + tm= a_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequence= r) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", - "MetricName": "tma_info_bottleneck_irregular_overhead", - "MetricThreshold": "tma_info_bottleneck_irregular_overhead > 10", - "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" - }, - { - "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_pmm_bound + tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_= 4k_aliasing + tma_dtlb_load + tma_fb_full + tma_l1_hit_latency + tma_lock_l= atency + tma_split_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_st= ore_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_pmm_bound + tma_store_bound)) * (tma_dtlb_store / (tma_dtlb_store + tma= _false_sharing + tma_split_stores + tma_store_latency + tma_streaming_store= s)))", - "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_data_tlbs", - "MetricThreshold": "tma_info_bottleneck_memory_data_tlbs > 20", - "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store, tma_info_bottleneck_memory_synchronization" - }, - { - "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma= _store_bound) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) *= tma_remote_cache / (tma_local_mem + tma_remote_cache + tma_remote_mem) + t= ma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound = + tma_pmm_bound + tma_store_bound) * (tma_contested_accesses + tma_data_sha= ring) / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + t= ma_sq_full) + tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bou= nd + tma_l3_bound + tma_pmm_bound + tma_store_bound) * tma_false_sharing / = (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_latency = + tma_streaming_stores - tma_store_latency)) + tma_machine_clears * (1 - tm= a_other_nukes / tma_other_nukes))", - "MetricGroup": "BvMS;Mem;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_synchronization", - "MetricThreshold": "tma_info_bottleneck_memory_synchronization > 1= 0", - "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_dtlb_load, tma_dtlb_store, tma_info_bottleneck_me= mory_data_tlbs" - }, - { - "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", - "MetricName": "tma_info_bottleneck_mispredictions", - "MetricThreshold": "tma_info_bottleneck_mispredictions > 20", - "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" - }, - { - "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", - "MetricExpr": "100 - (tma_info_bottleneck_big_code + tma_info_bott= leneck_instruction_fetch_bw + tma_info_bottleneck_mispredictions + tma_info= _bottleneck_cache_memory_bandwidth + tma_info_bottleneck_cache_memory_laten= cy + tma_info_bottleneck_memory_data_tlbs + tma_info_bottleneck_memory_sync= hronization + tma_info_bottleneck_compute_bound_est + tma_info_bottleneck_i= rregular_overhead + tma_info_bottleneck_branching_overhead + tma_info_bottl= eneck_useful_work)", - "MetricGroup": "BvOB;Cor;Offcore", - "MetricName": "tma_info_bottleneck_other_bottlenecks", - "MetricThreshold": "tma_info_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls." - }, - { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", - "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "BvUW;Ret", - "MetricName": "tma_info_bottleneck_useful_work", - "MetricThreshold": "tma_info_bottleneck_useful_work > 20" + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5" }, { "BriefDescription": "Fraction of branches that are CALL or RET", @@ -878,11 +947,11 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED.VECTOR) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -895,20 +964,20 @@ "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D1\\,edge@", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost" }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D1\\,edge@", + "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency" }, @@ -938,7 +1007,13 @@ "MetricName": "tma_info_frontend_l2mpki_code_all" }, { - "BriefDescription": "Branch instructions per taken branch.", + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, + { + "BriefDescription": "Branch instructions per taken branch", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -956,7 +1031,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -964,7 +1039,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -972,7 +1047,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit in= struction (lower number means higher occurrence rate)", @@ -980,7 +1055,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx512", "MetricThreshold": "tma_info_inst_mix_iparith_avx512 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -988,7 +1063,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -996,7 +1071,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -1041,7 +1116,7 @@ }, { "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", - "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umas= k\\=3D0xF@", + "MetricExpr": "INST_RETIRED.ANY / SW_PREFETCH_ACCESS.ANY", "MetricGroup": "Prefetches", "MetricName": "tma_info_inst_mix_ipswpf", "MetricThreshold": "tma_info_inst_mix_ipswpf < 100" @@ -1051,7 +1126,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 11", + "MetricThreshold": "tma_info_inst_mix_iptb < 5 * 2 + 1", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp" }, { @@ -1098,7 +1173,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw" }, @@ -1116,7 +1191,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw" }, @@ -1152,13 +1227,13 @@ }, { "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", - "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration= _time", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", "MetricGroup": "Mem;MemoryBW;Offcore", "MetricName": "tma_info_memory_l3_cache_access_bw" }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw" }, @@ -1177,21 +1252,15 @@ { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency" }, { "BriefDescription": "Average Parallel L2 cache miss demand Loads", - "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D1@", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", "MetricGroup": "Memory_BW;Offcore", "MetricName": "tma_info_memory_latency_load_l2_mlp" }, - { - "BriefDescription": "Average Latency for L3 cache miss demand Load= s", - "MetricExpr": "cpu@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,u= mask\\=3D0x10@ / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", - "MetricName": "tma_info_memory_latency_load_l3_miss_latency" - }, { "BriefDescription": "Actual Average Latency for L1 data-cache miss= demand load operations (in core cycles)", "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_RETIRED.L1_MISS += MEM_LOAD_RETIRED.FB_HIT)", @@ -1217,6 +1286,13 @@ "MetricName": "tma_info_memory_mlp", "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)" }, + { + "BriefDescription": "Rate of L2 HW prefetched lines that were not = used by demand accesses", + "MetricExpr": "L2_LINES_OUT.USELESS_HWPF / (L2_LINES_OUT.SILENT + = L2_LINES_OUT.NON_SILENT)", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_memory_prefetches_useless_hwpf", + "MetricThreshold": "tma_info_memory_prefetches_useless_hwpf > 0.15" + }, { "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", @@ -1244,7 +1320,7 @@ }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D1@)", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, @@ -1265,18 +1341,18 @@ "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -1294,28 +1370,28 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / duration_time", + "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / tma_info_system_time", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", - "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_bottlenec= k_cache_memory_bandwidth, tma_mem_bandwidth, tma_sq_full" + "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_bottleneck_cache_memory_ban= dwidth, tma_fb_full, tma_mem_bandwidth, tma_sq_full" }, { "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.8_FLOPS + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1= e9 / duration_time", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.8_FLOPS + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1= e9 / tma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" }, { "BriefDescription": "Average IO (network or disk) Bandwidth Use fo= r Reads [GB / sec]", - "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e9 / durati= on_time", + "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e9 / tma_in= fo_system_time", "MetricGroup": "IoBW;MemOffcore;Server;SoC", "MetricName": "tma_info_system_io_read_bw", "PublicDescription": "Average IO (network or disk) Bandwidth Use f= or Reads [GB / sec]. Bandwidth of IO reads that are initiated by end device= controllers that are requesting memory from the CPU" }, { "BriefDescription": "Average IO (network or disk) Bandwidth Use fo= r Writes [GB / sec]", - "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_HIT_ITOM + UNC_CHA_TOR_INSE= RTS.IO_MISS_ITOM + UNC_CHA_TOR_INSERTS.IO_HIT_ITOMCACHENEAR + UNC_CHA_TOR_I= NSERTS.IO_MISS_ITOMCACHENEAR) * 64 / 1e9 / duration_time", + "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_HIT_ITOM + UNC_CHA_TOR_INSE= RTS.IO_MISS_ITOM + UNC_CHA_TOR_INSERTS.IO_HIT_ITOMCACHENEAR + UNC_CHA_TOR_I= NSERTS.IO_MISS_ITOMCACHENEAR) * 64 / 1e9 / tma_info_system_time", "MetricGroup": "IoBW;MemOffcore;Server;SoC", "MetricName": "tma_info_system_io_write_bw", "PublicDescription": "Average IO (network or disk) Bandwidth Use f= or Writes [GB / sec]. Bandwidth of IO writes that are initiated by end devi= ce controllers that are writing memory to the CPU" @@ -1325,13 +1401,14 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi" + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -1347,45 +1424,46 @@ "MetricName": "tma_info_system_mem_dram_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal DRAM memory [in nanoseconds]. Accounts for demand loads and L1/L2 data= -read prefetches" }, + { + "BriefDescription": "Fraction of Uncore cycles where requests got = rejected due to duplicate address already in IRQ ingress queue in the cache= homing agent", + "MetricExpr": "UNC_CHA_RxC_IRQ1_REJECT.PA_MATCH / UNC_CHA_CLOCKTIC= KS", + "MetricGroup": "LockCont;MemOffcore;Server;SoC", + "MetricName": "tma_info_system_mem_irq_duplicate_address", + "MetricThreshold": "(tma_info_system_mem_irq_duplicate_address > 0= .1)" + }, { "BriefDescription": "Average number of parallel data read requests= to external memory", - "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_OCC= UPANCY.IA_MISS_DRD@thresh\\=3D1@", + "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / cha@UNC_CHA_TOR= _OCCUPANCY.IA_MISS_DRD\\,thresh\\=3D0x1@", "MetricGroup": "Mem;MemoryBW;SoC", "MetricName": "tma_info_system_mem_parallel_reads", "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches" }, - { - "BriefDescription": "Average latency of data read request to exter= nal 3D X-Point memory [in nanoseconds]", - "MetricExpr": "(1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_PMM / UNC= _CHA_TOR_INSERTS.IA_MISS_DRD_PMM) / cha_0@event\\=3D0x0@ if #has_pmem > 0 e= lse 0)", - "MetricGroup": "MemOffcore;MemoryLat;Server;SoC", - "MetricName": "tma_info_system_mem_pmm_read_latency", - "PublicDescription": "Average latency of data read request to exte= rnal 3D X-Point memory [in nanoseconds]. Accounts for demand loads and L1/L= 2 data-read prefetches" - }, { "BriefDescription": "Average latency of data read request to exter= nal memory (in nanoseconds)", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_= TOR_INSERTS.IA_MISS_DRD) / (tma_info_system_socket_clks / duration_time)", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_= TOR_INSERTS.IA_MISS_DRD) / (tma_info_system_socket_clks / tma_info_system_t= ime)", "MetricGroup": "Mem;MemoryLat;SoC", "MetricName": "tma_info_system_mem_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetche= s. ([RKL+]memory-controller only)" }, { - "BriefDescription": "Average 3DXP Memory Bandwidth Use for reads [= GB / sec]", - "MetricExpr": "(64 * UNC_M_PMM_RPQ_INSERTS / 1e9 / duration_time i= f #has_pmem > 0 else 0)", - "MetricGroup": "MemOffcore;MemoryBW;Server;SoC", - "MetricName": "tma_info_system_pmm_read_bw" + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" }, { - "BriefDescription": "Average 3DXP Memory Bandwidth Use for Writes = [GB / sec]", - "MetricExpr": "(64 * UNC_M_PMM_WPQ_INSERTS / 1e9 / duration_time i= f #has_pmem > 0 else 0)", - "MetricGroup": "MemOffcore;MemoryBW;Server;SoC", - "MetricName": "tma_info_system_pmm_write_bw" + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "(power@energy\\-pkg@ * 61 + 15.6 * power@energy\\-r= am@) / (duration_time * 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for baseline license level 0", "MetricExpr": "CORE_POWER.LVL0_TURBO_LICENSE / tma_info_core_core_= clks", "MetricGroup": "Power", "MetricName": "tma_info_system_power_license0_utilization", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes" }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 1", @@ -1393,7 +1471,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license1_utilization", "MetricThreshold": "tma_info_system_power_license1_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions" }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 2 (introduced in SKX)", @@ -1401,7 +1479,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license2_utilization", "MetricThreshold": "tma_info_system_power_license2_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions" }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", @@ -1415,6 +1493,13 @@ "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -1423,12 +1508,12 @@ }, { "BriefDescription": "Measured Average Uncore Frequency for the SoC= [GHz]", - "MetricExpr": "tma_info_system_socket_clks / 1e9 / duration_time", + "MetricExpr": "tma_info_system_socket_clks / 1e9 / tma_info_system= _time", "MetricGroup": "SoC", "MetricName": "tma_info_system_uncore_frequency" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -1437,14 +1522,15 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi" + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -1454,13 +1540,13 @@ }, { "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "TOPDOWN.SLOTS", + "MetricExpr": "slots", "MetricGroup": "TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots" }, { "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", - "MetricExpr": "(tma_info_thread_slots / (TOPDOWN.SLOTS / 2) if #SM= T_on else 1)", + "MetricExpr": "(tma_info_thread_slots / (slots / 2) if #SMT_on els= e 1)", "MetricGroup": "SMT;TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots_utilization" }, @@ -1476,33 +1562,41 @@ "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 7.5" + "MetricThreshold": "tma_info_thread_uptb < 5 * 1.5" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "MetricExpr": "tma_divider - tma_fp_divider", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_int_divider", + "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB= _HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_micr= ocode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates fraction of cyc= les with demand load accesses that hit the L1 cache", + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", - "MetricName": "tma_l1_hit_latency", - "MetricThreshold": "tma_l1_hit_latency > 0.1 & (tma_l1_bound > 0.1= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles with demand load accesses that hit the L1 cache. The short latency of = the L1 data cache may be exposed in pointer-chasing memory access patterns = as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%" }, { @@ -1511,8 +1605,17 @@ "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + L1D_PEND_MISS.FB_FULL_PERIODS)= * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks)", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT_PS", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "4 * tma_info_system_core_frequency * MEM_LOAD_RETIR= ED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / = tma_info_thread_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, { @@ -1521,17 +1624,17 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STA= LLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "19 * tma_info_system_core_frequency * (MEM_LOAD_RET= IRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2))= / tma_info_thread_clks", + "MetricExpr": "(23 * tma_info_system_core_frequency - 4 * tma_info= _system_core_frequency) * (MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.= FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_info_bottleneck_cache_memory_latency, tma_mem_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", "ScaleUnit": "100%" }, { @@ -1539,18 +1642,18 @@ "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", "ScaleUnit": "100%" }, { @@ -1567,7 +1670,7 @@ "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1575,15 +1678,39 @@ "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from local memory", - "MetricExpr": "43.5 * tma_info_system_core_frequency * MEM_LOAD_L3= _MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.= L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(66.5 * tma_info_system_core_frequency - 23 * tma_i= nfo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM= _LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks= ", "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_grou= p", "MetricName": "tma_local_mem", - "MetricThreshold": "tma_local_mem > 0.1 & (tma_mem_latency > 0.1 &= (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)= ))", + "MetricThreshold": "tma_local_mem > 0.1 & tma_mem_latency > 0.1 & = tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_L3_MISS= _RETIRED.LOCAL_DRAM", "ScaleUnit": "100%" }, @@ -1591,9 +1718,9 @@ "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10= * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAN= DING.CYCLES_WITH_DEMAND_RFO))) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, @@ -1604,16 +1731,16 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches, tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_bottleneck_cache= _memory_bandwidth, tma_info_system_dram_bw_use, tma_sq_full", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%" }, { @@ -1621,8 +1748,8 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_info_bottleneck_cache_memory_latency, tma_l3_hit_la= tency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%" }, { @@ -1632,11 +1759,11 @@ "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_= RETIRED.ANY", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", @@ -1650,7 +1777,7 @@ "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", "MetricName": "tma_microcode_sequencer", "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", - "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers,= tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_clears, = tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_bottleneck_irreg= ular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_m= s_switches", "ScaleUnit": "100%" }, { @@ -1658,8 +1785,8 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL= _BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_inf= o_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_branch_mispredicts, tma_info_bad_spec_branch_misprediction_cost= , tma_info_bottleneck_mispredictions", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%" }, { @@ -1673,19 +1800,27 @@ }, { "BriefDescription": "This metric represents fraction of cycles whe= re (only) 4 uops were delivered by the MITE pipeline", - "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=3D4@ - cpu@IDQ.MITE_UO= PS\\,cmask\\=3D5@) / tma_info_thread_clks", + "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=3D0x4@ - cpu@IDQ.MITE_= UOPS\\,cmask\\=3D0x5@) / tma_info_thread_clks", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_mite_gr= oup", "MetricName": "tma_mite_4wide", - "MetricThreshold": "tma_mite_4wide > 0.05 & (tma_mite > 0.1 & tma_= fetch_bandwidth > 0.2)", + "MetricThreshold": "tma_mite_4wide > 0.05 & tma_mite > 0.1 & tma_f= etch_bandwidth > 0.2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", "MetricExpr": "UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY= ", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "MetricExpr": "cpu@IDQ.MS_UOPS\\,cmask\\=3D0x1@ / tma_info_core_co= re_clks / 2", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", + "MetricName": "tma_ms", + "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", "ScaleUnit": "100%" }, { @@ -1693,8 +1828,8 @@ "MetricExpr": "3 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_= clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operat= ion", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, { @@ -1702,7 +1837,7 @@ "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, @@ -1717,28 +1852,19 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric roughly estimates (based on idle = latencies) how often the CPU was stalled on accesses to external 3D-Xpoint = (Crystal Ridge, a.k.a", - "MetricExpr": "(((1 - (19 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM = * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 10 * (MEM_LOA= D_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETI= RED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_RETIRED.= FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM *= (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))) / (19 * (MEM_LO= AD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RE= TIRED.L1_MISS)) + 10 * (MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD= _RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMO= TE_FWD * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOA= D_L3_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RET= IRED.L1_MISS)) + (25 * (MEM_LOAD_RETIRED.LOCAL_PMM * (1 + MEM_LOAD_RETIRED.= FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 33 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE= _PMM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))))) * (CYCL= E_ACTIVITY.STALLS_L3_MISS / tma_info_thread_clks + (CYCLE_ACTIVITY.STALLS_L= 1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_info_thread_clks - tma_l2_bo= und) if 1e6 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_PMM + MEM_LOAD_RETIRED.LOCAL= _PMM) > MEM_LOAD_RETIRED.L1_MISS else 0) if #has_pmem > 0 else 0)", - "MetricGroup": "MemoryBound;Server;TmaL3mem;TopdownL3;tma_L3_group= ;tma_memory_bound_group", - "MetricName": "tma_pmm_bound", - "MetricThreshold": "tma_pmm_bound > 0.1 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric roughly estimates (based on idle= latencies) how often the CPU was stalled on accesses to external 3D-Xpoint= (Crystal Ridge, a.k.a. IXP) memory by loads, PMM stands for Persistent Mem= ory Module.", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { @@ -1774,7 +1900,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_6. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1782,8 +1908,8 @@ "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL)) / tma_= info_thread_clks if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CY= CLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring = * EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", "ScaleUnit": "100%" }, { @@ -1791,8 +1917,8 @@ "MetricExpr": "cpu@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=3D0x80@ / t= ma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", "ScaleUnit": "100%" }, { @@ -1800,7 +1926,7 @@ "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", "ScaleUnit": "100%" }, @@ -1809,7 +1935,7 @@ "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, @@ -1818,32 +1944,32 @@ "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote cache in other socket= s including synchronizations issues", - "MetricExpr": "(97 * tma_info_system_core_frequency * MEM_LOAD_L3_= MISS_RETIRED.REMOTE_HITM + 97 * tma_info_system_core_frequency * MEM_LOAD_L= 3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRE= D.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "((120 * tma_info_system_core_frequency - 23 * tma_i= nfo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM + (120 * = tma_info_system_core_frequency - 23 * tma_info_system_core_frequency) * MEM= _LOAD_L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD= _RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_is= sueSyncxn;tma_mem_latency_group", "MetricName": "tma_remote_cache", - "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0= .1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > = 0.2)))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_L3_MISS_R= ETIRED.REMOTE_HITM_PS;MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD_PS. Related metri= cs: tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_machin= e_clears", + "MetricThreshold": "tma_remote_cache > 0.05 & tma_mem_latency > 0.= 1 & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2= ", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. Sample with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM, MEM= _LOAD_L3_MISS_RETIRED.REMOTE_FWD. Related metrics: tma_bottleneck_memory_sy= nchronization, tma_contested_accesses, tma_data_sharing, tma_false_sharing,= tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote memory", - "MetricExpr": "108 * tma_info_system_core_frequency * MEM_LOAD_L3_= MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.= L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(131 * tma_info_system_core_frequency - 23 * tma_in= fo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM= _LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks= ", "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latenc= y_group", "MetricName": "tma_remote_mem", - "MetricThreshold": "tma_remote_mem > 0.1 & (tma_mem_latency > 0.1 = & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. #link to NUMA article. Sample= with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM_PS", + "MetricThreshold": "tma_remote_mem > 0.1 & tma_mem_latency > 0.1 &= tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. Sample with: MEM_LOAD_L3_MISS= _RETIRED.REMOTE_DRAM", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_= thread_slots", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", "MetricGroup": "BvUW;Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", @@ -1856,7 +1982,7 @@ "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_thread_clks", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", "ScaleUnit": "100%" }, @@ -1865,7 +1991,7 @@ "MetricExpr": "37 * MISC_RETIRED.PAUSE_INST / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_slow_pause", - "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: MISC_RETIRED.PAUS= E_INST", "ScaleUnit": "100%" }, @@ -1874,8 +2000,8 @@ "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", "ScaleUnit": "100%" }, { @@ -1884,17 +2010,17 @@ "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", "ScaleUnit": "100%" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", "MetricExpr": "L1D_PEND_MISS.L2_STALL / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_bottleneck_cache_memory_bandwidth, tma_info_system_dram_bw_use, = tma_mem_bandwidth", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%" }, { @@ -1902,8 +2028,8 @@ "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", "ScaleUnit": "100%" }, { @@ -1912,17 +2038,17 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", "MetricExpr": "(L2_RQSTS.RFO_HIT * 10 * (1 - MEM_INST_RETIRED.LOCK= _LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / = MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUEST= S_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -1939,7 +2065,7 @@ "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", "ScaleUnit": "100%" }, { @@ -1947,7 +2073,31 @@ "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1955,7 +2105,7 @@ "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", "MetricName": "tma_streaming_stores", - "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound = > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", "ScaleUnit": "100%" }, @@ -1964,7 +2114,7 @@ "MetricExpr": "10 * BACLEARS.ANY / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, @@ -1973,8 +2123,8 @@ "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", "ScaleUnit": "100%" }, { diff --git a/tools/perf/pmu-events/arch/x86/icelakex/memory.json b/tools/pe= rf/pmu-events/arch/x86/icelakex/memory.json index 32a3dedb82fb..ec9577cce3ac 100644 --- a/tools/perf/pmu-events/arch/x86/icelakex/memory.json +++ b/tools/perf/pmu-events/arch/x86/icelakex/memory.json @@ -25,7 +25,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_128", "MSRIndex": "0x3F6", "MSRValue": "0x80", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 128 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "1009", "UMask": "0x1" @@ -38,7 +37,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_16", "MSRIndex": "0x3F6", "MSRValue": "0x10", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 16 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "20011", "UMask": "0x1" @@ -51,7 +49,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_256", "MSRIndex": "0x3F6", "MSRValue": "0x100", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 256 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "503", "UMask": "0x1" @@ -64,7 +61,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32", "MSRIndex": "0x3F6", "MSRValue": "0x20", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 32 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -77,7 +73,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4", "MSRIndex": "0x3F6", "MSRValue": "0x4", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 4 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "100003", "UMask": "0x1" @@ -90,7 +85,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_512", "MSRIndex": "0x3F6", "MSRValue": "0x200", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 512 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "101", "UMask": "0x1" @@ -103,7 +97,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64", "MSRIndex": "0x3F6", "MSRValue": "0x40", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 64 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "2003", "UMask": "0x1" @@ -116,7 +109,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_8", "MSRIndex": "0x3F6", "MSRValue": "0x8", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 8 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "50021", "UMask": "0x1" @@ -353,17 +345,16 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc9", "EventName": "RTM_RETIRED.ABORTED", - "PEBS": "1", "PublicDescription": "Counts the number of times RTM abort was tri= ggered.", "SampleAfterValue": "100003", "UMask": "0x4" }, { - "BriefDescription": "Number of times an RTM execution aborted due = to none of the previous 4 categories (e.g. interrupt)", + "BriefDescription": "Number of times an RTM execution aborted due = to none of the previous 3 categories (e.g. interrupt)", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc9", "EventName": "RTM_RETIRED.ABORTED_EVENTS", - "PublicDescription": "Counts the number of times an RTM execution = aborted due to none of the previous 4 categories (e.g. interrupt).", + "PublicDescription": "Counts the number of times an RTM execution = aborted due to none of the previous 3 categories (e.g. interrupt).", "SampleAfterValue": "100003", "UMask": "0x80" }, diff --git a/tools/perf/pmu-events/arch/x86/icelakex/metricgroups.json b/to= ols/perf/pmu-events/arch/x86/icelakex/metricgroups.json index cccfcab3425e..4a28187c2896 100644 --- a/tools/perf/pmu-events/arch/x86/icelakex/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/icelakex/metricgroups.json @@ -38,6 +38,7 @@ "IoBW": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -84,7 +85,9 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_mispredicts_group": "Metrics contributing to tma_branch_mi= spredicts category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", + "tma_code_stlb_miss_group": "Metrics contributing to tma_code_stlb_mis= s category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -94,6 +97,7 @@ "tma_fp_vector_group": "Metrics contributing to tma_fp_vector category= ", "tma_frontend_bound_group": "Metrics contributing to tma_frontend_boun= d category", "tma_heavy_operations_group": "Metrics contributing to tma_heavy_opera= tions category", + "tma_icache_misses_group": "Metrics contributing to tma_icache_misses = category", "tma_issue2P": "Metrics related by the issue $issue2P", "tma_issueBM": "Metrics related by the issue $issueBM", "tma_issueBW": "Metrics related by the issue $issueBW", @@ -113,10 +117,13 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", + "tma_l2_bound_group": "Metrics contributing to tma_l2_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", "tma_load_op_utilization_group": "Metrics contributing to tma_load_op_= utilization category", + "tma_load_stlb_miss_group": "Metrics contributing to tma_load_stlb_mis= s category", "tma_machine_clears_group": "Metrics contributing to tma_machine_clear= s category", "tma_mem_latency_group": "Metrics contributing to tma_mem_latency cate= gory", "tma_memory_bound_group": "Metrics contributing to tma_memory_bound ca= tegory", @@ -129,5 +136,6 @@ "tma_retiring_group": "Metrics contributing to tma_retiring category", "tma_serializing_operation_group": "Metrics contributing to tma_serial= izing_operation category", "tma_store_bound_group": "Metrics contributing to tma_store_bound cate= gory", - "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category" + "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category", + "tma_store_stlb_miss_group": "Metrics contributing to tma_store_stlb_m= iss category" } diff --git a/tools/perf/pmu-events/arch/x86/icelakex/pipeline.json b/tools/= perf/pmu-events/arch/x86/icelakex/pipeline.json index 74285b6c81e7..f1446f1b67c6 100644 --- a/tools/perf/pmu-events/arch/x86/icelakex/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/icelakex/pipeline.json @@ -9,6 +9,15 @@ "SampleAfterValue": "1000003", "UMask": "0x9" }, + { + "BriefDescription": "ARITH.FP_DIVIDER_ACTIVE", + "Counter": "0,1,2,3,4,5,6,7", + "CounterMask": "1", + "EventCode": "0x14", + "EventName": "ARITH.FP_DIVIDER_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x1" + }, { "BriefDescription": "Number of occurrences where a microcode assis= t is invoked by hardware.", "Counter": "0,1,2,3,4,5,6,7", @@ -23,7 +32,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all branch instructions retired.", "SampleAfterValue": "400009" }, @@ -32,7 +40,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts conditional branch instructions retir= ed.", "SampleAfterValue": "400009", "UMask": "0x11" @@ -42,7 +49,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts not taken branch instructions retired= .", "SampleAfterValue": "400009", "UMask": "0x10" @@ -52,7 +58,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional branch instructions= retired.", "SampleAfterValue": "400009", "UMask": "0x1" @@ -62,7 +67,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.FAR_BRANCH", - "PEBS": "1", "PublicDescription": "Counts far branch instructions retired.", "SampleAfterValue": "100007", "UMask": "0x40" @@ -72,7 +76,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts near indirect branch instructions ret= ired excluding returns. TSX abort is an indirect branch.", "SampleAfterValue": "100003", "UMask": "0x80" @@ -82,7 +85,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_CALL", - "PEBS": "1", "PublicDescription": "Counts both direct and indirect near call in= structions retired.", "SampleAfterValue": "100007", "UMask": "0x2" @@ -92,7 +94,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_RETURN", - "PEBS": "1", "PublicDescription": "Counts return instructions retired.", "SampleAfterValue": "100007", "UMask": "0x8" @@ -102,7 +103,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken branch instructions retired.", "SampleAfterValue": "400009", "UMask": "0x20" @@ -112,7 +112,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all the retired branch instructions t= hat were mispredicted by the processor. A branch misprediction occurs when = the processor incorrectly predicts the destination of the branch. When the= misprediction is discovered at execution, all the instructions executed in= the wrong (speculative) path must be discarded, and the processor must sta= rt fetching from the correct path.", "SampleAfterValue": "50021" }, @@ -121,7 +120,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts mispredicted conditional branch instr= uctions retired.", "SampleAfterValue": "50021", "UMask": "0x11" @@ -131,7 +129,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts the number of conditional branch inst= ructions retired that were mispredicted and the branch direction was not ta= ken.", "SampleAfterValue": "50021", "UMask": "0x10" @@ -141,7 +138,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional mispredicted branch= instructions retired.", "SampleAfterValue": "50021", "UMask": "0x1" @@ -151,7 +147,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts all miss-predicted indirect branch in= structions retired (excluding RETs. TSX aborts is considered indirect branc= h).", "SampleAfterValue": "50021", "UMask": "0x80" @@ -161,7 +156,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT_CALL", - "PEBS": "1", "PublicDescription": "Counts retired mispredicted indirect (near t= aken) calls, including both register and memory indirect.", "SampleAfterValue": "50021", "UMask": "0x2" @@ -171,7 +165,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts number of near branch instructions re= tired that were mispredicted and taken.", "SampleAfterValue": "50021", "UMask": "0x20" @@ -181,7 +174,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.RET", - "PEBS": "1", "PublicDescription": "This is a non-precise version (that is, does= not use PEBS) of the event that counts mispredicted return instructions re= tired.", "SampleAfterValue": "50021", "UMask": "0x8" @@ -377,7 +369,6 @@ "BriefDescription": "Number of instructions retired. Fixed Counter= - architectural event", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts the number of instructions retired - = an Architectural PerfMon event. Counting continues during hardware interrup= ts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is counte= d by a designated fixed counter freeing up programmable counters to count o= ther events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003", "UMask": "0x1" @@ -387,7 +378,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.ANY_P", - "PEBS": "1", "PublicDescription": "Counts the number of instructions retired - = an Architectural PerfMon event. Counting continues during hardware interrup= ts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is counte= d by a designated fixed counter freeing up programmable counters to count o= ther events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003" }, @@ -396,7 +386,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.NOP", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x2" }, @@ -404,7 +393,6 @@ "BriefDescription": "Precise instruction retired event with a redu= ced effect of PEBS shadow in IP distribution", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.PREC_DIST", - "PEBS": "1", "PublicDescription": "A version of INST_RETIRED that allows for a = more unbiased distribution of samples across instructions retired. It utili= zes the Precise Distribution of Instructions Retired (PDIR) feature to miti= gate some bias in how retired instructions get sampled. Use on Fixed Counte= r 0.", "SampleAfterValue": "2000003", "UMask": "0x1" diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index b6a0b67a44b8..83b8ca32fd1d 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -17,7 +17,7 @@ GenuineIntel-6-A[DE],v1.04,graniterapids,core GenuineIntel-6-(3C|45|46),v36,haswell,core GenuineIntel-6-3F,v29,haswellx,core GenuineIntel-6-7[DE],v1.24,icelake,core -GenuineIntel-6-6[AC],v1.26,icelakex,core +GenuineIntel-6-6[AC],v1.27,icelakex,core GenuineIntel-6-3A,v24,ivybridge,core GenuineIntel-6-3E,v24,ivytown,core GenuineIntel-6-2D,v24,jaketown,core --=20 2.48.0.rc2.279.g1de40edade-goog From nobody Sun Dec 14 13:49:51 2025 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C496F196C7C for ; Thu, 16 Jan 2025 06:44:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009971; cv=none; b=UUxMVf8AceGDnAAZq5mHJeC2FDrRgUM8+nCMUTgr0hrhr6EbROB0rfKBB17Qheh4/ViQpdnWq1y7IpjMbn6KsLvukrbQgiPiCe7L3KFfFJ9awOApsCQgzrq1uw3DlvRhXeZGugkw055rl2RGddLWpVSpGMvKvZFx9vkh/hbw/Us= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009971; c=relaxed/simple; bh=XjmdK/Avo5DfJydwx3oG6dYJdyXxw+YRfqIXaaMnIhE=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=onQWGKWwFiWr+lsfOhSuVViNG2ApLDrbU8s5A6IiF+bAaswS8s1Kq0deoxtMp65JKD/nGym1NjAuj3K0JnK4Y35jmOhiVfCUyGfgUOYMEqStvKkuqCPJpJWM5xh1gMwSNQ4TNSCc4208hzYv+hiUa7wuT68G9LT67HJHPRUzOaI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=z8lsySGo; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="z8lsySGo" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-e575dd2d472so1739625276.0 for ; Wed, 15 Jan 2025 22:44:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737009885; x=1737614685; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=mosHWUPxJUa7prOs+C9Q+8vNv/b/iYdp549GINRCH2Y=; b=z8lsySGox88VwcS6dw3IC340hdgEljrsj+rUqruZlm2Ze+ao+dsALsfv+YxmKmK5pi TsERQKPsYoRZeT/+bscF9Gc20O8NtPxM8oMctzwH7GRwS1t0/RN8uR0baUFZWCEdNkVR 5qAKhRA6WYZxJlEfmD1fODcvIYgDm0lFE5xHZHaioCZpRvOjWXHC/2XWkdPeN70ZFJVA 6waqTGBGo5AAU19pEc+C4CtHM1QpNJ/94XgnhMmJu2hbt21N6QKHxL5+YKHSD1sfNLvU dD0YkVdQL5qrySW70ddFso2zTVQlj2GcFjpuusu2rLGahCx5muzRcL1xvBVQiumSX0pU YHZg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737009885; x=1737614685; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=mosHWUPxJUa7prOs+C9Q+8vNv/b/iYdp549GINRCH2Y=; b=BAMuc/phCye2I85St5PVKcWzeToREc5GoeUnMTOLS9kZdLhYh48d/ug9Opu1Vzgmpw ZT6WuywaQi4dpBE4Odx0m1FrQKywyfXVRErWmRQ7tjOdCk8R2aAdItFiV6ipaU+SzzYM tM5Fem/hc8zVuhFsVOf38e7koDXzcyRSQPgsM91cbAJNY5BiiA3HFeuscSNEGQGdBcu4 D9M6JyfMEKSuZYis80I/FCdWlYlqAusiWpXyMg2GO7cnfEkvdd0HmLHl5WCQO4fH0/ux frAGoIrQgFRxeR8L2W664Odc1yHijtZAF2i19lEL/KkOVKFdAOS7g+mxWMFEMwIoGiej v1Iw== X-Forwarded-Encrypted: i=1; AJvYcCVq3lKAo4QaYN5kQmUv0hUQiHLNkzW8lJ2HTMGWc0kuO/EL89cQ0iB7Sg19e70gj4LiRirBUeVoXqlzgkE=@vger.kernel.org X-Gm-Message-State: AOJu0YyF/efH+71lrAMOkI5YZRFtC0WEXeZ66v0QxvZZx70lyNMM+TG2 9Gs0HFq+Yfrbe/JXt6gmS/8nhEi/SG9lLf6lNAv4QMbh3gDfrCR8uGiez5fuegaEPjjtSLcKbcT 341kPhg== X-Google-Smtp-Source: AGHT+IGWAWcSbcN4NSpo9qiFN4X0XEHA7kWFyeYBYDAmQ1DoEmsP9Id68cT92rx5WUJJ1xEbmwUOGWsASBZ+ X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:7bd:ce19:c701:bfa3]) (user=irogers job=sendgmr) by 2002:a05:6902:15c9:b0:e39:8506:c03d with SMTP id 3f1490d57ef6-e54ee273f9emr83076276.9.1737009885441; Wed, 15 Jan 2025 22:44:45 -0800 (PST) Date: Wed, 15 Jan 2025 22:43:48 -0800 In-Reply-To: <20250116064355.344245-1-irogers@google.com> Message-Id: <20250116064355.344245-17-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250116064355.344245-1-irogers@google.com> X-Mailer: git-send-email 2.48.0.rc2.279.g1de40edade-goog Subject: [PATCH v2 16/23] perf vendor events: Update/add Lunarlake events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.01 to v1.10. Add TMA metrics 5.01. Bring in the event updates v1.10: https://github.com/intel/perfmon/commit/4a1cff8cebe9791a1ceb91ca39fc64e9139= a3993 https://github.com/intel/perfmon/commit/cbc3b0dc19e8fc52c9604f1da301648ed69= f012b https://github.com/intel/perfmon/commit/28f4b24f9152a0ee1fb3435535628384ad8= 81c22 https://github.com/intel/perfmon/commit/172900e962fdd34ddb80879f4f91add5f77= 3ca29 https://github.com/intel/perfmon/commit/dab0308f7a27d2c644e08d63436b790a207= fb22e The TMA 5.01 addition is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-developed-by: Caleb Biggers Signed-off-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../pmu-events/arch/x86/lunarlake/cache.json | 1322 ++++- .../arch/x86/lunarlake/floating-point.json | 484 ++ .../arch/x86/lunarlake/frontend.json | 654 ++- .../arch/x86/lunarlake/lnl-metrics.json | 4611 +++++++++++++++++ .../pmu-events/arch/x86/lunarlake/memory.json | 262 +- .../arch/x86/lunarlake/metricgroups.json | 150 + .../pmu-events/arch/x86/lunarlake/other.json | 496 +- .../arch/x86/lunarlake/pipeline.json | 2104 +++++++- .../arch/x86/lunarlake/uncore-memory.json | 18 + .../arch/x86/lunarlake/virtual-memory.json | 428 ++ tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 11 files changed, 10218 insertions(+), 313 deletions(-) create mode 100644 tools/perf/pmu-events/arch/x86/lunarlake/floating-point= .json create mode 100644 tools/perf/pmu-events/arch/x86/lunarlake/lnl-metrics.js= on create mode 100644 tools/perf/pmu-events/arch/x86/lunarlake/metricgroups.j= son create mode 100644 tools/perf/pmu-events/arch/x86/lunarlake/uncore-memory.= json diff --git a/tools/perf/pmu-events/arch/x86/lunarlake/cache.json b/tools/pe= rf/pmu-events/arch/x86/lunarlake/cache.json index 759714618e08..9bad6a354b2b 100644 --- a/tools/perf/pmu-events/arch/x86/lunarlake/cache.json +++ b/tools/perf/pmu-events/arch/x86/lunarlake/cache.json @@ -1,4 +1,255 @@ [ + { + "BriefDescription": "Counts the number of request that were not ac= cepted into the L2Q because the L2Q is FULL.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x31", + "EventName": "CORE_REJECT_L2Q.ANY", + "PublicDescription": "Counts the number of (demand and L1 prefetch= ers) core requests rejected by the L2Q due to a full or nearly full w condi= tion which likely indicates back pressure from L2Q. It also counts request= s that would have gone directly to the XQ, but are rejected due to a full o= r nearly full condition, indicating back pressure from the IDI link. The L= 2Q may also reject transactions from a core to insure fairness between cor= es, or to delay a cores dirty eviction when the address conflicts incoming = external snoops. (Note that L2 prefetcher requests that are dropped are no= t counted by this event.)", + "SampleAfterValue": "1000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of L1D cacheline (dirty) ev= ictions caused by load misses, stores, and prefetches.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x51", + "EventName": "DL1.DIRTY_EVICTION", + "PublicDescription": "Counts the number of L1D cacheline (dirty) e= victions caused by load misses, stores, and prefetches. Does not count evi= ctions or dirty writebacks caused by snoops. Does not count a replacement = unless a (dirty) line was written back.", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cache lines replaced in = L0 data cache.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x51", + "EventName": "L1D.L0_REPLACEMENT", + "PublicDescription": "Counts L0 data line replacements including o= pportunistic replacements, and replacements that require stall-for-replace = or block-for-replace.", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cachelines replaced into the L0 and L1 d-cach= e. Successful replacements only (not blocked) and exclude WB-miss case", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x51", + "EventName": "L1D.REPLACEMENT", + "PublicDescription": "Counts cachelines replaced into the L0 and L= 1 d-cache.", + "SampleAfterValue": "1000003", + "UMask": "0x5", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of cycles a demand request has waited = due to L1D Fill Buffer (FB) unavailability.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x49", + "EventName": "L1D_MISS.FB_FULL", + "PublicDescription": "Counts number of cycles a demand request has= waited due to L1D Fill Buffer (FB) unavailability. Demand requests include= cacheable/uncacheable demand load, store, lock or SW prefetch accesses.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of cycles a demand request has waited = due to L1D due to lack of L2 resources.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x49", + "EventName": "L1D_MISS.L2_STALLS", + "PublicDescription": "Counts number of cycles a demand request has= waited due to L1D due to lack of L2 resources. Demand requests include cac= heable/uncacheable demand load, store, lock or SW prefetch accesses.", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of demand requests that missed L1D cac= he", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x49", + "EventName": "L1D_MISS.LOAD", + "PublicDescription": "Count occurrences (rising-edge) of DCACHE_PE= NDING sub-event0. Impl. sends per-port binary inc-bit the occupancy increas= es* (at FB alloc or promotion).", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of L1D misses that are outstanding", + "Counter": "2", + "EventCode": "0x48", + "EventName": "L1D_PENDING.LOAD", + "PublicDescription": "Counts number of L1D misses that are outstan= ding in each cycle, that is each cycle the number of Fill Buffers (FB) outs= tanding required by Demand Reads. FB either is held by demand loads, or it = is held by non-demand loads and gets hit at least once by demand. The valid= outstanding interval is defined until the FB deallocation by one of the fo= llowing ways: from FB allocation, if FB is allocated by demand from the dem= and Hit FB, if it is allocated by hardware or software prefetch. Note: In t= he L1D, a Demand Read contains cacheable or noncacheable demand loads, incl= uding ones causing cache-line splits and reads due to page walks resulted f= rom any request type.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles with L1D load Misses outstanding.", + "Counter": "2", + "CounterMask": "1", + "EventCode": "0x48", + "EventName": "L1D_PENDING.LOAD_CYCLES", + "PublicDescription": "Counts duration of L1D miss outstanding in c= ycles.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache lines filling L2", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x25", + "EventName": "L2_LINES_IN.ALL", + "PublicDescription": "Counts the number of L2 cache lines filling = the L2. Counting does not cover rejects.", + "SampleAfterValue": "100003", + "UMask": "0x1f", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of cache lines filled into = the L2 cache that are in Exclusive state", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x25", + "EventName": "L2_LINES_IN.E", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cache lines filled into = the L2 cache that are in Forward state", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x25", + "EventName": "L2_LINES_IN.F", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cache lines filled into = the L2 cache that are in Invalid state", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x25", + "EventName": "L2_LINES_IN.I", + "PublicDescription": "Counts the number of cache lines filled into= the L2 cache that are in Invalid state, does not count lines that go Inval= id due to an eviction", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cache lines filled into = the L2 cache that are in Modified state", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x25", + "EventName": "L2_LINES_IN.M", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cache lines filled into = the L2 cache that are in Shared state", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x25", + "EventName": "L2_LINES_IN.S", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of L2 cache lines that are = evicted due to an L2 cache fill", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x26", + "EventName": "L2_LINES_OUT.NON_SILENT", + "PublicDescription": "Counts the number of L2 cache lines that are= evicted due to an L2 cache fill. Increments on the core that brought the l= ine in originally.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Modified cache lines that are evicted by L2 c= ache when triggered by an L2 cache fill.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x26", + "EventName": "L2_LINES_OUT.NON_SILENT", + "PublicDescription": "Counts the number of lines that are evicted = by L2 cache when triggered by an L2 cache fill. Those lines are in Modified= state. Modified lines are written back to L3", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of L2 cache lines that are = silently dropped due to an L2 cache fill", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x26", + "EventName": "L2_LINES_OUT.SILENT", + "PublicDescription": "Counts the number of L2 cache lines that are= silently dropped due to an L2 cache fill. Increments on the core that bro= ught the line in originally.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x26", + "EventName": "L2_LINES_OUT.SILENT", + "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache. These lines are typically in Shared or Exclusive stat= e. A non-threaded event.", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of L2 cache lines that have= been L2 hardware prefetched but not used by demand accesses", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x26", + "EventName": "L2_LINES_OUT.USELESS_HWPF", + "PublicDescription": "Counts the number of L2 cache lines that hav= e been L2 hardware prefetched but not used by demand accesses. Increments = on the core that brought the line in originally.", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cache lines that have been L2 hardware prefet= ched but not used by demand accesses", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x26", + "EventName": "L2_LINES_OUT.USELESS_HWPF", + "PublicDescription": "Counts the number of cache lines that have b= een prefetched by the L2 hardware prefetcher but not used by demand access = when evicted from the L2 cache", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of L2 prefetches initiated = by either the L2 Stream or AMP that were throttled due to Dynamic Prefetch= Throttling. The throttle requestor/source could be from the uncore/SOC or = the Dead Block Predictor. Counts on a per core basis.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x28", + "EventName": "L2_PREFETCHES_THROTTLED.DPT", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of L2 prefetches initiated = by the L2 Stream that were throttled due to Demand Throttle Prefetcher. DT= P Global Triggered with no Local Override. Counts on a per core basis.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x28", + "EventName": "L2_PREFETCHES_THROTTLED.DTP", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of L2 prefetches initiated = by the L2 Stream and not throttled by DTP due to local override. These pre= fetches may still be throttled due to another throttler mechanism besides D= TP. Counts on a per core basis.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x28", + "EventName": "L2_PREFETCHES_THROTTLED.DTP_OVERRIDE", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of L2 prefetches initiated = by either the L2 Stream or AMP that were throttled due to exceeding the XQ = threshold set by either XQ_THRESOLD_DTP or XQ_THRESHOLD. Counts on a per co= re basis.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x28", + "EventName": "L2_PREFETCHES_THROTTLED.XQ_THRESH", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of demand and prefetch tran= sactions that the External Queue (XQ) rejects due to a full or near full co= ndition.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x30", + "EventName": "L2_REJECT_XQ.ANY", + "PublicDescription": "Counts the number of demand and prefetch tra= nsactions that the External Queue (XQ) rejects due to a full or near full c= ondition which likely indicates back pressure from the IDI link. The XQ ma= y reject transactions from the L2Q (non-cacheable requests), BBL (L2 misses= ) and WOB (L2 write-back victims).", + "SampleAfterValue": "1000003", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts the number of L2 Cache Accesses Counts= the total number of L2 Cache Accesses - sum of hits, misses, rejects fron= t door requests for CRd/DRd/RFO/ItoM/L2 Prefetches only, per core event", "Counter": "0,1,2,3,4,5,6,7", @@ -9,69 +260,672 @@ "UMask": "0x7", "Unit": "cpu_atom" }, + { + "BriefDescription": "All accesses to L2 cache [This event is alias= to L2_RQSTS.REFERENCES, L2_RQSTS.ANY]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_REQUEST.ALL", + "PublicDescription": "Counts all requests that were hit or true mi= sses in L2 cache. True-miss excludes misses that were merged with ongoing L= 2 misses. [This event is alias to L2_RQSTS.REFERENCES, L2_RQSTS.ANY]", + "SampleAfterValue": "200003", + "UMask": "0xff", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of L2 Cache Accesses that r= esulted in a Hit from a front door request only (does not include rejects o= r recycles), per core event", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x24", + "EventName": "L2_REQUEST.HIT", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of total L2 Cache Accesses = that resulted in a Miss from a front door request only (does not include re= jects or recycles), per core event", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x24", + "EventName": "L2_REQUEST.MISS", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Read requests with true-miss in L2 cache [Thi= s event is alias to L2_RQSTS.MISS]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_REQUEST.MISS", + "PublicDescription": "Counts read requests of any type with true-m= iss in the L2 cache. True-miss excludes L2 misses that were merged with ong= oing L2 misses. [This event is alias to L2_RQSTS.MISS]", + "SampleAfterValue": "200003", + "UMask": "0x3f", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of L2 Cache Accesses that m= iss the L2 and get BBL reject short and long rejects (includes those count= ed in L2_reject_XQ.any), per core event", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x24", + "EventName": "L2_REQUEST.REJECTS", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Demand Data Read access L2 cache", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.ALL_DEMAND_DATA_RD", + "PublicDescription": "Counts Demand Data Read requests accessing t= he L2 cache. These requests may hit or miss L2 cache. True-miss exclude mis= ses that were merged with ongoing L2 misses. An access is counted once.", + "SampleAfterValue": "200003", + "UMask": "0xe1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "All accesses to L2 cache [This event is alias= to L2_RQSTS.REFERENCES, L2_REQUEST.ALL]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.ANY", + "PublicDescription": "Counts all requests that were hit or true mi= sses in L2 cache. True-miss excludes misses that were merged with ongoing L= 2 misses. [This event is alias to L2_RQSTS.REFERENCES, L2_REQUEST.ALL]", + "SampleAfterValue": "200003", + "UMask": "0xff", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache misses when fetching instructions", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.CODE_RD_MISS", + "PublicDescription": "Counts L2 cache misses when fetching instruc= tions.", + "SampleAfterValue": "200003", + "UMask": "0x24", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Demand Data Read requests that hit L2 cache", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.DEMAND_DATA_RD_HIT", + "PublicDescription": "Counts the number of demand Data Read reques= ts initiated by load instructions that hit L2 cache.", + "SampleAfterValue": "200003", + "UMask": "0x41", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Demand Data Read miss L2 cache", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.DEMAND_DATA_RD_MISS", + "PublicDescription": "Counts demand Data Read requests with true-m= iss in the L2 cache. True-miss excludes misses that were merged with ongoin= g L2 misses. An access is counted once.", + "SampleAfterValue": "200003", + "UMask": "0x21", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Read requests with true-miss in L2 cache [Thi= s event is alias to L2_REQUEST.MISS]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.MISS", + "PublicDescription": "Counts read requests of any type with true-m= iss in the L2 cache. True-miss excludes L2 misses that were merged with ong= oing L2 misses. [This event is alias to L2_REQUEST.MISS]", + "SampleAfterValue": "200003", + "UMask": "0x3f", + "Unit": "cpu_core" + }, + { + "BriefDescription": "All accesses to L2 cache [This event is alias= to L2_REQUEST.ALL,L2_RQSTS.ANY]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.REFERENCES", + "PublicDescription": "Counts all requests that were hit or true mi= sses in L2 cache. True-miss excludes misses that were merged with ongoing L= 2 misses. [This event is alias to L2_REQUEST.ALL,L2_RQSTS.ANY]", + "SampleAfterValue": "200003", + "UMask": "0xff", + "Unit": "cpu_core" + }, + { + "BriefDescription": "RFO requests that miss L2 cache", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x24", + "EventName": "L2_RQSTS.RFO_MISS", + "PublicDescription": "Counts the RFO (Read-for-Ownership) requests= that miss L2 cache.", + "SampleAfterValue": "200003", + "UMask": "0x22", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when L1D is locked", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x42", + "EventName": "LOCK_CYCLES.CACHE_LOCK_DURATION", + "PublicDescription": "This event counts the number of cycles when = the L1D is locked.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, { "BriefDescription": "Counts the number of cacheable memory request= s that miss in the LLC. Counts on a per core basis.", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0x2e", "EventName": "LONGEST_LAT_CACHE.MISS", - "PublicDescription": "Counts the number of cacheable memory reques= ts that miss in the Last Level Cache (LLC). Requests include demand loads, = reads for ownership (RFO), instruction fetches and L1 HW prefetches. If the= platform has an L3 cache, the LLC is the L3 cache, otherwise it is the L2 = cache. Counts on a per core basis.", + "PublicDescription": "Counts the number of cacheable memory reques= ts that miss in the Last Level Cache (LLC). Requests include demand loads, = reads for ownership (RFO), instruction fetches and L1 HW prefetches. If the= core has access to an L3 cache, the LLC is the L3 cache, otherwise it is t= he L2 cache. Counts on a per core basis.", "SampleAfterValue": "200003", "UMask": "0x41", "Unit": "cpu_atom" }, { - "BriefDescription": "Core-originated cacheable requests that misse= d L3 (Except hardware prefetches to the L3)", - "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0x2e", - "EventName": "LONGEST_LAT_CACHE.MISS", - "PublicDescription": "Counts core-originated cacheable requests th= at miss the L3 cache (Longest Latency cache). Requests include data and cod= e reads, Reads-for-Ownership (RFOs), speculative accesses and hardware pref= etches to the L1 and L2. It does not include hardware prefetches to the L3= , and may not count other types of requests to the L3.", - "SampleAfterValue": "100003", - "UMask": "0x41", - "Unit": "cpu_core" + "BriefDescription": "Core-originated cacheable requests that misse= d L3 (Except hardware prefetches to the L3)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x2e", + "EventName": "LONGEST_LAT_CACHE.MISS", + "PublicDescription": "Counts core-originated cacheable requests th= at miss the L3 cache (Longest Latency cache). Requests include data and cod= e reads, Reads-for-Ownership (RFOs), speculative accesses and hardware pref= etches to the L1 and L2. It does not include hardware prefetches to the L3= , and may not count other types of requests to the L3.", + "SampleAfterValue": "100003", + "UMask": "0x41", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of cacheable memory request= s that access the LLC. Counts on a per core basis.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x2e", + "EventName": "LONGEST_LAT_CACHE.REFERENCE", + "PublicDescription": "Counts the number of cacheable memory reques= ts that access the Last Level Cache (LLC). Requests include demand loads, r= eads for ownership (RFO), instruction fetches and L1 HW prefetches. If the = core has access to an L3 cache, the LLC is the L3 cache, otherwise it is th= e L2 cache. Counts on a per core basis.", + "SampleAfterValue": "200003", + "UMask": "0x4f", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Core-originated cacheable requests that refer= to L3 (Except hardware prefetches to the L3)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x2e", + "EventName": "LONGEST_LAT_CACHE.REFERENCE", + "PublicDescription": "Counts core-originated cacheable requests to= the L3 cache (Longest Latency cache). Requests include data and code reads= , Reads-for-Ownership (RFOs), speculative accesses and hardware prefetches = to the L1 and L2. It does not include hardware prefetches to the L3, and m= ay not count other types of requests to the L3.", + "SampleAfterValue": "100003", + "UMask": "0x4f", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to an instruction cache or TLB miss.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x35", + "EventName": "MEM_BOUND_STALLS_IFETCH.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x7f", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to an instruction cache or TLB miss which hit in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x35", + "EventName": "MEM_BOUND_STALLS_IFETCH.L2_HIT", + "PublicDescription": "Counts the number of cycles the core is stal= led due to an instruction cache or Translation Lookaside Buffer (TLB) miss = which hit in the L2 cache. Includes L2 Hit resulting from and L1D eviction= of another core in the same module which is longer latency than a typical = L2 hit.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to an instruction cache or TLB miss which missed in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x35", + "EventName": "MEM_BOUND_STALLS_IFETCH.L2_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x7e", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to an L1 demand load miss.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x34", + "EventName": "MEM_BOUND_STALLS_LOAD.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x7f", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to a demand load which hit in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x34", + "EventName": "MEM_BOUND_STALLS_LOAD.L2_HIT", + "PublicDescription": "Counts the number of cycles a core is stalle= d due to a demand load which hit in the L2 cache. Includes L2 Hit resultin= g from and L1D eviction of another core in the same module which is longer = latency than a typical L2 hit.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to a demand load which missed in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x34", + "EventName": "MEM_BOUND_STALLS_LOAD.L2_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x7e", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to a demand load miss which missed all the local cache= s.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x34", + "EventName": "MEM_BOUND_STALLS_LOAD.LLC_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x78", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled to a store buffer full condition", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x34", + "EventName": "MEM_BOUND_STALLS_LOAD.SBFULL", + "SampleAfterValue": "1000003", + "UMask": "0x80", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts all retired load instructions.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.ALL_LOADS", + "PublicDescription": "Counts Instructions with at least one archit= ecturally visible load retired.", + "SampleAfterValue": "1000003", + "UMask": "0x81", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired store instructions.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.ALL_STORES", + "PublicDescription": "Counts all retired store instructions.", + "SampleAfterValue": "1000003", + "UMask": "0x82", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired software prefetch instructions.", + "Counter": "0,1,2,3", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.ALL_SWPF", + "PublicDescription": "Counts all retired software prefetch instruc= tions.", + "SampleAfterValue": "1000003", + "UMask": "0x84", + "Unit": "cpu_core" + }, + { + "BriefDescription": "All retired memory instructions.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.ANY", + "PublicDescription": "Counts all retired memory instructions - loa= ds and stores.", + "SampleAfterValue": "1000003", + "UMask": "0x87", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions with locked access.= ", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.LOCK_LOADS", + "PublicDescription": "Counts retired load instructions with locked= access.", + "SampleAfterValue": "100007", + "UMask": "0x21", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions that split across a= cacheline boundary.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.SPLIT_LOADS", + "PublicDescription": "Counts retired load instructions that split = across a cacheline boundary.", + "SampleAfterValue": "100003", + "UMask": "0x41", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired store instructions that split across = a cacheline boundary.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.SPLIT_STORES", + "PublicDescription": "Counts retired store instructions that split= across a cacheline boundary.", + "SampleAfterValue": "100003", + "UMask": "0x42", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions that hit the STLB.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.STLB_HIT_LOADS", + "PublicDescription": "Number of retired load instructions with a c= lean hit in the 2nd-level TLB (STLB).", + "SampleAfterValue": "100003", + "UMask": "0x9", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired store instructions that hit the STLB.= ", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.STLB_HIT_STORES", + "PublicDescription": "Number of retired store instructions that hi= t in the 2nd-level TLB (STLB).", + "SampleAfterValue": "100003", + "UMask": "0xa", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions that miss the STLB.= ", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.STLB_MISS_LOADS", + "PublicDescription": "Number of retired load instructions that (st= art a) miss in the 2nd-level TLB (STLB).", + "SampleAfterValue": "100003", + "UMask": "0x11", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired store instructions that miss the STLB= .", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_INST_RETIRED.STLB_MISS_STORES", + "PublicDescription": "Number of retired store instructions that (s= tart a) miss in the 2nd-level TLB (STLB).", + "SampleAfterValue": "100003", + "UMask": "0x12", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions whose data sources = were a cross-core Snoop hits and forwards data from an in on-package core c= ache (induced by NI$)", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd2", + "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD", + "PublicDescription": "Counts retired load instructions whose data = sources were a cross-core Snoop hits and forwards data from an in on-packag= e core cache (induced by NI$)", + "SampleAfterValue": "20011", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions whose data sources = were HitM responses from shared L3, Hit-with-FWD is normally excluded.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd2", + "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM", + "PublicDescription": "Counts retired load instructions whose data = sources were HitM responses from shared L3, Hit-with-FWD is normally exclud= ed.", + "SampleAfterValue": "20011", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions whose data sources = were L3 hit and cross-core snoop missed in on-pkg core cache.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd2", + "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS", + "PublicDescription": "Counts the retired load instructions whose d= ata sources were L3 hit and cross-core snoop missed in on-pkg core cache.", + "SampleAfterValue": "20011", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions whose data sources = were L3 and cross-core snoop hits in on-pkg core cache", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd2", + "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD", + "PublicDescription": "Counts retired load instructions whose data = sources were L3 and cross-core snoop hits in on-pkg core cache.", + "SampleAfterValue": "20011", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions which data source i= s memory side cache.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd3", + "EventName": "MEM_LOAD_L3_MISS_RETIRED.MEMSIDE_CACHE", + "SampleAfterValue": "100007", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions with at least 1 uncachea= ble load or lock.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd4", + "EventName": "MEM_LOAD_MISC_RETIRED.UC", + "PublicDescription": "Retired instructions with at least one load = to uncacheable memory-type, or at least one cache-line split locked access = (Bus Lock).", + "SampleAfterValue": "100007", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of completed demand load requests that= missed the L1, but hit the FB(fill buffer), because a preceding miss to th= e same cacheline initiated the line to be brought into L1, but data is not = yet ready in L1.", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.FB_HIT", + "PublicDescription": "Counts retired load instructions with at lea= st one uop was load missed in L1 but hit FB (Fill Buffers) due to preceding= miss to the same cache line with data not ready.", + "SampleAfterValue": "100007", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions with L1 cache hits = as data sources", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.L1_HIT", + "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L1 data cache. This event includes all SW prefet= ches and lock instructions regardless of the data source.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts retired load instructions with at leas= t one uop that hit in the Level 1 of the L1 data cache.", + "Counter": "0,1,2,3", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.L1_HIT_L1", + "SampleAfterValue": "1000003", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions missed L1 cache as = data sources", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.L1_MISS", + "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L1 cache.", + "SampleAfterValue": "200003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions with L2 cache hits = as data sources", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.L2_HIT", + "PublicDescription": "Counts retired load instructions with L2 cac= he hits as data sources.", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions missed L2 cache as = data sources", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.L2_MISS", + "PublicDescription": "Counts retired load instructions missed L2 c= ache as data sources.", + "SampleAfterValue": "100021", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions with L3 cache hits = as data sources", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.L3_HIT", + "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L3 cache.", + "SampleAfterValue": "100021", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired load instructions missed L3 cache as = data sources", + "Counter": "0,1,2,3", + "Data_LA": "1", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_RETIRED.L3_MISS", + "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L3 cache.", + "SampleAfterValue": "50021", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM", + "Counter": "0,1,2,3,4,5,6,7", + "Deprecated": "1", + "EventCode": "0xd3", + "EventName": "MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = MEM_LOAD_UOPS_LLC_MISS_RETIRED.MEMSIDE_CACHE", + "Counter": "0,1,2,3,4,5,6,7", + "Deprecated": "1", + "EventCode": "0xd3", + "EventName": "MEM_LOAD_UOPS_L3_MISS_RETIRED.MEMSIDE_CACHE", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load ops retired that mi= ss the L2 cache, missed the Memory Side Cache and hit in DRAM", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd3", + "EventName": "MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load ops retired that mi= ss the LLC cache and hit in the Memory Side Cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd3", + "EventName": "MEM_LOAD_UOPS_LLC_MISS_RETIRED.MEMSIDE_CACHE", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load ops retired that hi= t the L1 data cache", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L1_HIT", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load ops retired that mi= ss in the L1 data cache", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L1_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load ops retired that hi= t in the L2 cache", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L2_HIT", + "PublicDescription": "Counts the number of load ops retired that h= it in the L2 cache. Includes L2 Hit resulting from and L1D eviction of ano= ther core in the same module which is longer latency than a typical L2 hit.= ", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load ops retired that mi= ss in the L2 cache", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.L2_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x80", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of loads that hit in a writ= e combining buffer (WCB), excluding the first load that caused the WCB to a= llocate.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xd1", + "EventName": "MEM_LOAD_UOPS_RETIRED.WCB_HIT", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that uops are blo= cked for any of the following reasons: load buffer, store buffer or RSV fu= ll.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x04", + "EventName": "MEM_SCHEDULER_BLOCK.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x7", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that uops are blo= cked due to load buffer full", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x04", + "EventName": "MEM_SCHEDULER_BLOCK.LD_BUF", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that uops are blo= cked due to RSV full", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x04", + "EventName": "MEM_SCHEDULER_BLOCK.RSV", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of cacheable memory request= s that access the LLC. Counts on a per core basis.", + "BriefDescription": "Counts the number of cycles that uops are blo= cked due to store buffer full", "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0x2e", - "EventName": "LONGEST_LAT_CACHE.REFERENCE", - "PublicDescription": "Counts the number of cacheable memory reques= ts that access the Last Level Cache (LLC). Requests include demand loads, r= eads for ownership (RFO), instruction fetches and L1 HW prefetches. If the = platform has an L3 cache, the LLC is the L3 cache, otherwise it is the L2 c= ache. Counts on a per core basis.", - "SampleAfterValue": "200003", - "UMask": "0x4f", + "EventCode": "0x04", + "EventName": "MEM_SCHEDULER_BLOCK.ST_BUF", + "SampleAfterValue": "1000003", + "UMask": "0x1", "Unit": "cpu_atom" }, { - "BriefDescription": "Core-originated cacheable requests that refer= to L3 (Except hardware prefetches to the L3)", + "BriefDescription": "MEM_STORE_RETIRED.L2_HIT", "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0x2e", - "EventName": "LONGEST_LAT_CACHE.REFERENCE", - "PublicDescription": "Counts core-originated cacheable requests to= the L3 cache (Longest Latency cache). Requests include data and code reads= , Reads-for-Ownership (RFOs), speculative accesses and hardware prefetches = to the L1 and L2. It does not include hardware prefetches to the L3, and m= ay not count other types of requests to the L3.", - "SampleAfterValue": "100003", - "UMask": "0x4f", + "EventCode": "0x44", + "EventName": "MEM_STORE_RETIRED.L2_HIT", + "SampleAfterValue": "200003", + "UMask": "0x1", "Unit": "cpu_core" }, { - "BriefDescription": "Retired load instructions.", - "Counter": "0,1,2,3", - "Data_LA": "1", - "EventCode": "0xd0", - "EventName": "MEM_INST_RETIRED.ALL_LOADS", - "PEBS": "1", - "PublicDescription": "Counts all retired load instructions. This e= vent accounts for SW prefetch instructions of PREFETCHNTA or PREFETCHT0/1/2= or PREFETCHW.", - "SampleAfterValue": "1000003", - "UMask": "0x81", + "BriefDescription": "Number of cache-lines required by retired sto= res whose Data Source is: Memory Side Cache", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x44", + "EventName": "MEM_STORE_RETIRED.MEMSIDE_CACHE", + "SampleAfterValue": "100021", "Unit": "cpu_core" }, { - "BriefDescription": "Retired store instructions.", - "Counter": "0,1,2,3", + "BriefDescription": "Counts the number of memory uops retired. A = single uop that performs both a load AND a store will be counted as 1, not = 2 (e.g. ADD [mem], CONST)", + "Counter": "0,1,2,3,4,5,6,7", "Data_LA": "1", "EventCode": "0xd0", - "EventName": "MEM_INST_RETIRED.ALL_STORES", - "PEBS": "1", - "PublicDescription": "Counts all retired store instructions.", - "SampleAfterValue": "1000003", - "UMask": "0x82", - "Unit": "cpu_core" + "EventName": "MEM_UOPS_RETIRED.ALL", + "SampleAfterValue": "200003", + "UMask": "0x83", + "Unit": "cpu_atom" }, { "BriefDescription": "Counts the number of load uops retired.", @@ -79,7 +933,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_UOPS_RETIRED.ALL_LOADS", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x81", "Unit": "cpu_atom" @@ -90,24 +943,10 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_UOPS_RETIRED.ALL_STORES", - "PEBS": "1", "SampleAfterValue": "200003", "UMask": "0x82", "Unit": "cpu_atom" }, - { - "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled", - "Counter": "0,1,2,3,4,5,6,7", - "Data_LA": "1", - "EventCode": "0xd0", - "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_1024", - "MSRIndex": "0x3F6", - "MSRValue": "0x400", - "PEBS": "2", - "SampleAfterValue": "200003", - "UMask": "0x5", - "Unit": "cpu_atom" - }, { "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled", "Counter": "0,1,2,3,4,5,6,7", @@ -116,7 +955,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_128", "MSRIndex": "0x3F6", "MSRValue": "0x80", - "PEBS": "2", "SampleAfterValue": "200003", "UMask": "0x5", "Unit": "cpu_atom" @@ -129,20 +967,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_16", "MSRIndex": "0x3F6", "MSRValue": "0x10", - "PEBS": "2", - "SampleAfterValue": "200003", - "UMask": "0x5", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts the number of tagged load uops retired= that exceed the latency threshold defined in MEC_CR_PEBS_LD_LAT_THRESHOLD = - Only counts with PEBS enabled", - "Counter": "0,1,2,3,4,5,6,7", - "Data_LA": "1", - "EventCode": "0xd0", - "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_2048", - "MSRIndex": "0x3F6", - "MSRValue": "0x800", - "PEBS": "2", "SampleAfterValue": "200003", "UMask": "0x5", "Unit": "cpu_atom" @@ -155,7 +979,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_256", "MSRIndex": "0x3F6", "MSRValue": "0x100", - "PEBS": "2", "SampleAfterValue": "200003", "UMask": "0x5", "Unit": "cpu_atom" @@ -168,7 +991,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_32", "MSRIndex": "0x3F6", "MSRValue": "0x20", - "PEBS": "2", "SampleAfterValue": "200003", "UMask": "0x5", "Unit": "cpu_atom" @@ -181,7 +1003,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_4", "MSRIndex": "0x3F6", "MSRValue": "0x4", - "PEBS": "2", "SampleAfterValue": "200003", "UMask": "0x5", "Unit": "cpu_atom" @@ -194,7 +1015,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_512", "MSRIndex": "0x3F6", "MSRValue": "0x200", - "PEBS": "2", "SampleAfterValue": "200003", "UMask": "0x5", "Unit": "cpu_atom" @@ -207,7 +1027,6 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_64", "MSRIndex": "0x3F6", "MSRValue": "0x40", - "PEBS": "2", "SampleAfterValue": "200003", "UMask": "0x5", "Unit": "cpu_atom" @@ -220,20 +1039,373 @@ "EventName": "MEM_UOPS_RETIRED.LOAD_LATENCY_GT_8", "MSRIndex": "0x3F6", "MSRValue": "0x8", - "PEBS": "2", "SampleAfterValue": "200003", "UMask": "0x5", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts the number of load uops retired that p= erformed one or more locks", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.LOCK_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x21", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of memory renamed load uops= retired.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.MRN_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x9", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of memory renamed store uop= s retired.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.MRN_STORES", + "SampleAfterValue": "200003", + "UMask": "0xa", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of memory uops retired that= were splits.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.SPLIT", + "SampleAfterValue": "200003", + "UMask": "0x43", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired split load uops.= ", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.SPLIT_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x41", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired split store uops= .", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.SPLIT_STORES", + "SampleAfterValue": "200003", + "UMask": "0x42", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of memory uops retired that= missed in the second level TLB.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS", + "SampleAfterValue": "200003", + "UMask": "0x13", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load ops retired that fi= lled the STLB - includes those in DTLB_LOAD_MISSES submasks", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x11", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of store ops retired (store= STLB miss)", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS_STORES", + "SampleAfterValue": "200003", + "UMask": "0x12", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts the number of stores uops retired sam= e as MEM_UOPS_RETIRED.ALL_STORES", "Counter": "0,1,2,3,4,5,6,7", "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_UOPS_RETIRED.STORE_LATENCY", - "PEBS": "2", "SampleAfterValue": "200003", "UMask": "0x6", "Unit": "cpu_atom" + }, + { + "BriefDescription": "Retired memory uops for any access", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe5", + "EventName": "MEM_UOP_RETIRED.ANY", + "PublicDescription": "Number of retired micro-operations (uops) fo= r load or store memory accesses", + "SampleAfterValue": "1000003", + "UMask": "0xf", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by mem side cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.MEMSIDE_CACHE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x11F80000004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y the L3 cache where a snoop hit in another cores caches, data forwarding i= s required as the data is modified.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x40001E00001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y the L3 cache where a snoop hit in another cores caches which forwarded th= e unmodified data to the requesting core.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x20001E00001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y mem side cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_DATA_RD.MEMSIDE_CACHE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x11F80000001", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that were su= pplied by the L3 cache where a snoop hit in another cores caches, data forw= arding is required as the data is modified.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x40001E00002", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Any memory transaction that reached the SQ.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x21", + "EventName": "OFFCORE_REQUESTS.ALL_REQUESTS", + "PublicDescription": "Counts memory transactions reached the super= queue including requests initiated by the core, all L3 prefetches, page wa= lks, etc..", + "SampleAfterValue": "100003", + "UMask": "0x80", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Demand and prefetch data reads", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x21", + "EventName": "OFFCORE_REQUESTS.DATA_RD", + "PublicDescription": "Counts the demand and prefetch data reads. A= ll Core Data Reads include cacheable 'Demands' and L2 prefetchers (not L3 p= refetchers). Counting also covers reads due to page walks resulted from any= request type.", + "SampleAfterValue": "100003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cacheable and Non-Cacheable code read request= s", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x21", + "EventName": "OFFCORE_REQUESTS.DEMAND_CODE_RD", + "PublicDescription": "Counts both cacheable and Non-Cacheable code= read requests.", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Demand Data Read requests sent to uncore", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x21", + "EventName": "OFFCORE_REQUESTS.DEMAND_DATA_RD", + "PublicDescription": "Counts the Demand Data Read requests sent to= uncore. Use it in conjunction with OFFCORE_REQUESTS_OUTSTANDING to determi= ne average latency in the uncore.", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Demand RFO requests including regular RFOs, l= ocks, ItoM", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x21", + "EventName": "OFFCORE_REQUESTS.DEMAND_RFO", + "PublicDescription": "Counts the demand RFO (read for ownership) r= equests including regular RFOs, locks, ItoM.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when offcore outstanding cacheable Cor= e Data Read transactions are present in SuperQueue (SQ), queue to uncore.", + "Counter": "0,1,2,3", + "CounterMask": "1", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "PublicDescription": "Counts cycles when offcore outstanding cache= able Core Data Read transactions are present in the super queue. A transact= ion is considered to be in the Offcore outstanding state between L2 miss an= d transaction completion sent to requestor (SQ de-allocation). See correspo= nding Umask under OFFCORE_REQUESTS.", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles with offcore outstanding Code Reads tr= ansactions in the SuperQueue (SQ), queue to uncore.", + "Counter": "0,1,2,3", + "CounterMask": "1", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_CODE= _RD", + "PublicDescription": "Counts the number of offcore outstanding Cod= e Reads transactions in the super queue every cycle. The 'Offcore outstandi= ng' state of the transaction lasts from the L2 miss until the sending trans= action completion to requestor (SQ deallocation). See the corresponding Uma= sk under OFFCORE_REQUESTS.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles where at least 1 outstanding demand da= ta read request is pending.", + "Counter": "0,1,2,3", + "CounterMask": "1", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_DATA= _RD", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles with offcore outstanding demand rfo re= ads transactions in SuperQueue (SQ), queue to uncore.", + "Counter": "0,1,2,3", + "CounterMask": "1", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO", + "PublicDescription": "Counts the number of offcore outstanding dem= and rfo Reads transactions in the super queue every cycle. The 'Offcore out= standing' state of the transaction lasts from the L2 miss until the sending= transaction completion to requestor (SQ deallocation). See the correspondi= ng Umask under OFFCORE_REQUESTS.", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "OFFCORE_REQUESTS_OUTSTANDING.DATA_RD", + "Counter": "0,1,2,3", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.DATA_RD", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Offcore outstanding Code Reads transactions i= n the SuperQueue (SQ), queue to uncore, every cycle.", + "Counter": "0,1,2,3", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_CODE_RD", + "PublicDescription": "Counts the number of offcore outstanding Cod= e Reads transactions in the super queue every cycle. The 'Offcore outstandi= ng' state of the transaction lasts from the L2 miss until the sending trans= action completion to requestor (SQ deallocation). See the corresponding Uma= sk under OFFCORE_REQUESTS.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "For every cycle, increments by the number of = outstanding demand data read requests pending.", + "Counter": "0,1,2,3", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD", + "PublicDescription": "For every cycle, increments by the number of= outstanding demand data read requests pending. Requests are considered o= utstanding from the time they miss the core's L2 cache until the transactio= n completion message is sent to the requestor.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Store Read transactions pending for off-core.= Highly correlated.", + "Counter": "0,1,2,3", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_RFO", + "PublicDescription": "Counts the number of off-core outstanding re= ad-for-ownership (RFO) store transactions every cycle. An RFO transaction i= s considered to be in the Off-core outstanding state between L2 cache miss = and transaction completion.", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts bus locks, accounts for cache line spl= it locks and UC locks.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x2c", + "EventName": "SQ_MISC.BUS_LOCK", + "PublicDescription": "Counts the more expensive bus lock needed to= enforce cache coherency for certain memory accesses that need to be done a= tomically. Can be created by issuing an atomic instruction (via the LOCK p= refix) which causes a cache line split or accesses uncacheable memory.", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of PREFETCHNTA, PREFETCHW, = PREFETCHT0, PREFETCHT1 or PREFETCHT2 instructions executed.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x40", + "EventName": "SW_PREFETCH_ACCESS.ANY", + "SampleAfterValue": "100003", + "UMask": "0xf", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of PREFETCHNTA instructions executed.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x40", + "EventName": "SW_PREFETCH_ACCESS.NTA", + "PublicDescription": "Counts the number of PREFETCHNTA instruction= s executed.", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of PREFETCHW instructions executed.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x40", + "EventName": "SW_PREFETCH_ACCESS.PREFETCHW", + "PublicDescription": "Counts the number of PREFETCHW instructions = executed.", + "SampleAfterValue": "100003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of PREFETCHT0 instructions executed.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x40", + "EventName": "SW_PREFETCH_ACCESS.T0", + "PublicDescription": "Counts the number of PREFETCHT0 instructions= executed.", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of PREFETCHT1 or PREFETCHT2 instructio= ns executed.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x40", + "EventName": "SW_PREFETCH_ACCESS.T1_T2", + "PublicDescription": "Counts the number of PREFETCHT1 or PREFETCHT= 2 instructions executed.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to an icache miss", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.ICACHE", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_atom" } ] diff --git a/tools/perf/pmu-events/arch/x86/lunarlake/floating-point.json b= /tools/perf/pmu-events/arch/x86/lunarlake/floating-point.json new file mode 100644 index 000000000000..d436e0c13e7e --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/lunarlake/floating-point.json @@ -0,0 +1,484 @@ +[ + { + "BriefDescription": "Counts the number of cycles when any of the f= loating point dividers are active.", + "Counter": "0,1,2,3,4,5,6,7", + "CounterMask": "1", + "EventCode": "0xcd", + "EventName": "ARITH.FPDIV_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles when floating-point divide unit is bus= y executing divide or square root operations.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0xb0", + "EventName": "ARITH.FPDIV_ACTIVE", + "PublicDescription": "Counts cycles when divide unit is busy execu= ting divide or square root operations. Accounts for floating-point operatio= ns only.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of active floating point di= viders per cycle in the loop stage.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xcd", + "EventName": "ARITH.FPDIV_OCCUPANCY", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of floating point divider u= ops executed per cycle.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xcd", + "EventName": "ARITH.FPDIV_UOPS", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts all microcode FP assists.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc1", + "EventName": "ASSISTS.FP", + "PublicDescription": "Counts all microcode Floating Point assists.= ", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "ASSISTS.SSE_AVX_MIX", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc1", + "EventName": "ASSISTS.SSE_AVX_MIX", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of FP-arith-uops dispatched on 1st VEC= port (port 0). FP-arith-uops are of type ADD* / SUB* / MUL / FMA* / DPP.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb3", + "EventName": "FP_ARITH_DISPATCHED.V0", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of FP-arith-uops dispatched on 2nd VEC= port (port 1)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb3", + "EventName": "FP_ARITH_DISPATCHED.V1", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of FP-arith-uops dispatched on 3rd VEC= port (port 5)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb3", + "EventName": "FP_ARITH_DISPATCHED.V2", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of FP-arith-uops dispatched on 4th VEC= port", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb3", + "EventName": "FP_ARITH_DISPATCHED.V3", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.128B_PACKED_DOUBLE", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.128B_PACKED_SINGLE", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE", + "SampleAfterValue": "100003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.256B_PACKED_DOUBLE", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.256B_PACKED_SINGLE", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE", + "SampleAfterValue": "100003", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.4_FLOPS", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.4_FLOPS", + "SampleAfterValue": "100003", + "UMask": "0x18", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.SCALAR", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.SCALAR", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.SCALAR_DOUBLE", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.SCALAR_DOUBLE", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.SCALAR_SINGLE", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.SCALAR_SINGLE", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. Refer to new event = FP_ARITH_OPS_RETIRED.VECTOR", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "Deprecated": "1", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.VECTOR", + "SampleAfterValue": "1000003", + "UMask": "0x3c", + "Unit": "cpu_core" + }, + { + "BriefDescription": "FP_ARITH_INST_RETIRED.VECTOR_128B [This event= is alias to FP_ARITH_OPS_RETIRED.VECTOR_128B]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.VECTOR_128B", + "SampleAfterValue": "100003", + "UMask": "0xc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "FP_ARITH_INST_RETIRED.VECTOR_256B [This event= is alias to FP_ARITH_OPS_RETIRED.VECTOR_256B]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_INST_RETIRED.VECTOR_256B", + "SampleAfterValue": "100003", + "UMask": "0x30", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts number of SSE/AVX computational 128-bi= t packed double precision floating-point instructions retired; some instruc= tions will count twice as noted below. Each count represents 2 computation= operations, one for each element. Applies to SSE* and AVX* packed double = precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN= MAX SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice = as they perform 2 calculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.128B_PACKED_DOUBLE", + "PublicDescription": "Number of SSE/AVX computational 128-bit pack= ed double precision floating-point instructions retired; some instructions = will count twice as noted below. Each count represents 2 computation opera= tions, one for each element. Applies to SSE* and AVX* packed double precis= ion floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX S= QRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as the= y perform 2 calculations per element. The DAZ and FTZ flags in the MXCSR re= gister need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of SSE/AVX computational 128-bit packe= d single precision floating-point instructions retired; some instructions w= ill count twice as noted below. Each count represents 4 computation operat= ions, one for each element. Applies to SSE* and AVX* packed single precisi= on floating-point instructions: ADD SUB MUL DIV MIN MAX RCP14 RSQRT14 SQRT = DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count twice as they pe= rform 2 calculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.128B_PACKED_SINGLE", + "PublicDescription": "Number of SSE/AVX computational 128-bit pack= ed single precision floating-point instructions retired; some instructions = will count twice as noted below. Each count represents 4 computation opera= tions, one for each element. Applies to SSE* and AVX* packed single precis= ion floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX S= QRT RSQRT RCP DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count tw= ice as they perform 2 calculations per element. The DAZ and FTZ flags in th= e MXCSR register need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts number of SSE/AVX computational 256-bi= t packed double precision floating-point instructions retired; some instruc= tions will count twice as noted below. Each count represents 4 computation= operations, one for each element. Applies to SSE* and AVX* packed double = precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN= MAX SQRT FM(N)ADD/SUB. FM(N)ADD/SUB instructions count twice as they perf= orm 2 calculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.256B_PACKED_DOUBLE", + "PublicDescription": "Number of SSE/AVX computational 256-bit pack= ed double precision floating-point instructions retired; some instructions = will count twice as noted below. Each count represents 4 computation opera= tions, one for each element. Applies to SSE* and AVX* packed double precis= ion floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX S= QRT FM(N)ADD/SUB. FM(N)ADD/SUB instructions count twice as they perform 2 = calculations per element. The DAZ and FTZ flags in the MXCSR register need = to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts number of SSE/AVX computational 256-bi= t packed single precision floating-point instructions retired; some instruc= tions will count twice as noted below. Each count represents 8 computation= operations, one for each element. Applies to SSE* and AVX* packed single = precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN= MAX SQRT RSQRT RCP DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions co= unt twice as they perform 2 calculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.256B_PACKED_SINGLE", + "PublicDescription": "Number of SSE/AVX computational 256-bit pack= ed single precision floating-point instructions retired; some instructions = will count twice as noted below. Each count represents 8 computation opera= tions, one for each element. Applies to SSE* and AVX* packed single precis= ion floating-point instructions: ADD SUB HADD HSUB SUBADD MUL DIV MIN MAX S= QRT RSQRT RCP DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB instructions count tw= ice as they perform 2 calculations per element. The DAZ and FTZ flags in th= e MXCSR register need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of SSE/AVX computational 128-bit packe= d single and 256-bit packed double precision FP instructions retired; some = instructions will count twice as noted below. Each count represents 2 or/a= nd 4 computation operations, 1 for each element. Applies to SSE* and AVX* = packed single precision and packed double precision FP instructions: ADD SU= B HADD HSUB SUBADD MUL DIV MIN MAX RCP14 RSQRT14 SQRT DPP FM(N)ADD/SUB. DP= P and FM(N)ADD/SUB count twice as they perform 2 calculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.4_FLOPS", + "PublicDescription": "Number of SSE/AVX computational 128-bit pack= ed single precision and 256-bit packed double precision floating-point ins= tructions retired; some instructions will count twice as noted below. Each= count represents 2 or/and 4 computation operations, one for each element. = Applies to SSE* and AVX* packed single precision floating-point and packed= double precision floating-point instructions: ADD SUB HADD HSUB SUBADD MUL= DIV MIN MAX RCP14 RSQRT14 SQRT DPP FM(N)ADD/SUB. DPP and FM(N)ADD/SUB ins= tructions count twice as they perform 2 calculations per element. The DAZ a= nd FTZ flags in the MXCSR register need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x18", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of SSE/AVX computational scalar floati= ng-point instructions retired; some instructions will count twice as noted = below. Applies to SSE* and AVX* scalar, double and single precision floati= ng-point: ADD SUB MUL DIV MIN MAX RCP14 RSQRT14 RANGE SQRT DPP FM(N)ADD/SUB= . DPP and FM(N)ADD/SUB instructions count twice as they perform multiple c= alculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.SCALAR", + "PublicDescription": "Number of SSE/AVX computational scalar singl= e precision and double precision floating-point instructions retired; some = instructions will count twice as noted below. Each count represents 1 comp= utational operation. Applies to SSE* and AVX* scalar single precision float= ing-point instructions: ADD SUB MUL DIV MIN MAX SQRT RSQRT RCP FM(N)ADD/SUB= . FM(N)ADD/SUB instructions count twice as they perform 2 calculations per= element. The DAZ and FTZ flags in the MXCSR register need to be set when u= sing these events.", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts number of SSE/AVX computational scalar= double precision floating-point instructions retired; some instructions wi= ll count twice as noted below. Each count represents 1 computational opera= tion. Applies to SSE* and AVX* scalar double precision floating-point instr= uctions: ADD SUB MUL DIV MIN MAX SQRT FM(N)ADD/SUB. FM(N)ADD/SUB instructi= ons count twice as they perform 2 calculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.SCALAR_DOUBLE", + "PublicDescription": "Number of SSE/AVX computational scalar doubl= e precision floating-point instructions retired; some instructions will cou= nt twice as noted below. Each count represents 1 computational operation. = Applies to SSE* and AVX* scalar double precision floating-point instruction= s: ADD SUB MUL DIV MIN MAX SQRT FM(N)ADD/SUB. FM(N)ADD/SUB instructions co= unt twice as they perform 2 calculations per element. The DAZ and FTZ flags= in the MXCSR register need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts number of SSE/AVX computational scalar= single precision floating-point instructions retired; some instructions wi= ll count twice as noted below. Each count represents 1 computational opera= tion. Applies to SSE* and AVX* scalar single precision floating-point instr= uctions: ADD SUB MUL DIV MIN MAX SQRT RSQRT RCP FM(N)ADD/SUB. FM(N)ADD/SUB= instructions count twice as they perform 2 calculations per element.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.SCALAR_SINGLE", + "PublicDescription": "Number of SSE/AVX computational scalar singl= e precision floating-point instructions retired; some instructions will cou= nt twice as noted below. Each count represents 1 computational operation. = Applies to SSE* and AVX* scalar single precision floating-point instruction= s: ADD SUB MUL DIV MIN MAX SQRT RSQRT RCP FM(N)ADD/SUB. FM(N)ADD/SUB instr= uctions count twice as they perform 2 calculations per element. The DAZ and= FTZ flags in the MXCSR register need to be set when using these events.", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of any Vector retired FP arithmetic in= structions", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.VECTOR", + "PublicDescription": "Number of any Vector retired FP arithmetic i= nstructions. The DAZ and FTZ flags in the MXCSR register need to be set wh= en using these events.", + "SampleAfterValue": "1000003", + "UMask": "0x3c", + "Unit": "cpu_core" + }, + { + "BriefDescription": "FP_ARITH_OPS_RETIRED.VECTOR_128B [This event = is alias to FP_ARITH_INST_RETIRED.VECTOR_128B]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.VECTOR_128B", + "SampleAfterValue": "100003", + "UMask": "0xc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "FP_ARITH_OPS_RETIRED.VECTOR_256B [This event = is alias to FP_ARITH_INST_RETIRED.VECTOR_256B]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc7", + "EventName": "FP_ARITH_OPS_RETIRED.VECTOR_256B", + "SampleAfterValue": "100003", + "UMask": "0x30", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of all types of floating po= int operations per uop with all default weighting", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc8", + "EventName": "FP_FLOPS_RETIRED.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of floating point operation= s that produce 32 bit single precision results", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc8", + "EventName": "FP_FLOPS_RETIRED.FP32", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of floating point operation= s that produce 64 bit double precision results", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc8", + "EventName": "FP_FLOPS_RETIRED.FP64", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a packed 128 bit double precision floating point. This may b= e SSE or AVX.128 operations.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.128B_DP", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a packed 128 bit single precision floating point. This may b= e SSE or AVX.128 operations.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.128B_SP", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a packed 256 bit double precision floating point.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.256B_DP", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a packed 256 bit single precision floating point.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.256B_SP", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a scalar 32bit single precision floating point", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.32B_SP", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retired instructions who= se sources are a scalar 64 bit double precision floating point", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.64B_DP", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the total number of floating point re= tired instructions.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc7", + "EventName": "FP_INST_RETIRED.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x3f", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on all flo= ating point ports.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb2", + "EventName": "FP_VINT_UOPS_EXECUTED.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x1f", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on floatin= g point and vector integer port 0.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb2", + "EventName": "FP_VINT_UOPS_EXECUTED.P0", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on floatin= g point and vector integer port 1.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb2", + "EventName": "FP_VINT_UOPS_EXECUTED.P1", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on floatin= g point and vector integer port 2.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb2", + "EventName": "FP_VINT_UOPS_EXECUTED.P2", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on floatin= g point and vector integer port 3.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb2", + "EventName": "FP_VINT_UOPS_EXECUTED.P3", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on floatin= g point and vector integer port 0, 1, 2, 3.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb2", + "EventName": "FP_VINT_UOPS_EXECUTED.PRIMARY", + "SampleAfterValue": "1000003", + "UMask": "0x1e", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on floatin= g point and vector integer store data port.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb2", + "EventName": "FP_VINT_UOPS_EXECUTED.STD", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of floating point operation= s retired that required microcode assist.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.FP_ASSIST", + "PublicDescription": "Counts the number of floating point operatio= ns retired that required microcode assist, which is not a reflection of the= number of FP operations, instructions or uops.", + "SampleAfterValue": "20003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of floating point divide uo= ps retired (x87 and sse, including x87 sqrt)", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.FPDIV", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_atom" + } +] diff --git a/tools/perf/pmu-events/arch/x86/lunarlake/frontend.json b/tools= /perf/pmu-events/arch/x86/lunarlake/frontend.json index 0327bece0f94..07bd38a1904e 100644 --- a/tools/perf/pmu-events/arch/x86/lunarlake/frontend.json +++ b/tools/perf/pmu-events/arch/x86/lunarlake/frontend.json @@ -1,4 +1,446 @@ [ + { + "BriefDescription": "Counts the total number of BACLEARS due to al= l branch types including conditional and unconditional jumps, returns, and = indirect branches.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe6", + "EventName": "BACLEARS.ANY", + "PublicDescription": "Counts the total number of BACLEARS, which o= ccur when the Branch Target Buffer (BTB) prediction or lack thereof, was co= rrected by a later branch predictor in the frontend. Includes BACLEARS due= to all branch types including conditional and unconditional jumps, returns= , and indirect branches.", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Clears due to Unknown Branches.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x60", + "EventName": "BACLEARS.ANY", + "PublicDescription": "Number of times the front-end is resteered w= hen it finds a branch instruction in a fetch line. This is called Unknown B= ranch which occurs for the first time a branch instruction is fetched or wh= en the branch is not tracked by the BPU (Branch Prediction Unit) anymore.", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of BACLEARS due to a condit= ional jump.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe6", + "EventName": "BACLEARS.COND", + "SampleAfterValue": "200003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of BACLEARS due to an indir= ect branch.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe6", + "EventName": "BACLEARS.INDIRECT", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of BACLEARS due to a return= branch.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe6", + "EventName": "BACLEARS.RETURN", + "SampleAfterValue": "200003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of BACLEARS due to a direct= , unconditional jump.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe6", + "EventName": "BACLEARS.UNCOND", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Stalls caused by changing prefix length of th= e instruction.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x87", + "EventName": "DECODE.LCP", + "PublicDescription": "Counts cycles that the Instruction Length de= coder (ILD) stalls occurred due to dynamically changing prefix length of th= e decoded instruction (by operand size prefix instruction 0x66, address siz= e prefix instruction 0x67 or REX.W for Intel64). Count is proportional to t= he number of prefixes in a 16B-line. This may result in a three-cycle penal= ty for each LCP (Length changing prefix) in a 16-byte chunk.", + "SampleAfterValue": "500009", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles the Microcode Sequencer is busy.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x87", + "EventName": "DECODE.MS_BUSY", + "SampleAfterValue": "500009", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of times a decode restricti= on reduces the decode throughput due to wrong instruction length prediction= .", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe9", + "EventName": "DECODE_RESTRICTION.PREDECODE_WRONG", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "DSB-to-MITE switch true penalty cycles.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x61", + "EventName": "DSB2MITE_SWITCHES.PENALTY_CYCLES", + "PublicDescription": "Decode Stream Buffer (DSB) is a Uop-cache th= at holds translations of previously fetched instructions that were decoded = by the legacy x86 decode pipeline (MITE). This event counts fetch penalty c= ycles when a transition occurs from DSB to MITE.", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged with having preceded with frontend bound behavior", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.ALL", + "SampleAfterValue": "1000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Retired ANT branches", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.ANY_ANT", + "MSRIndex": "0x3F7", + "MSRValue": "0x9", + "PublicDescription": "Always Not Taken (ANT) conditional retired b= ranches (no BTB entry and not mispredicted)", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired Instructions who experienced DSB miss= .", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.ANY_DSB_MISS", + "MSRIndex": "0x3F7", + "MSRValue": "0x1", + "PublicDescription": "Counts retired Instructions that experienced= DSB (Decode stream buffer i.e. the decoded instruction-cache) miss.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of instruction retired that= are tagged after a branch instruction causes bubbles/empty issue slots due= to a baclear", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.BRANCH_DETECT", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of instruction retired that= are tagged after a branch instruction causes bubbles /empty issue slots du= e to a btclear", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.BRANCH_RESTEER", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged following an ms flow due to the bubble/wasted issue slot from= exiting long ms flow", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.CISC", + "PublicDescription": "Counts the number of instructions retired t= hat were tagged following an ms flow due to the bubble/wasted issue slot fr= om exiting long ms flow", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged every cycle the decoder is unable to send 4 uops", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.DECODE", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Retired Instructions who experienced a critic= al DSB miss.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.DSB_MISS", + "MSRIndex": "0x3F7", + "MSRValue": "0x11", + "PublicDescription": "Number of retired Instructions that experien= ced a critical DSB (Decode stream buffer i.e. the decoded instruction-cache= ) miss. Critical means stalls were exposed to the back-end as a result of t= he DSB miss.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged because empty issue slots were seen before the uop due to ica= che miss", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.ICACHE", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged because empty issue slots were seen before the uop due to ITL= B miss", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.ITLB_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Retired Instructions who experienced iTLB tru= e miss.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.ITLB_MISS", + "MSRIndex": "0x3F7", + "MSRValue": "0x14", + "PublicDescription": "Counts retired Instructions that experienced= iTLB (Instruction TLB) true miss.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired Instructions who experienced Instruct= ion L1 Cache true miss.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.L1I_MISS", + "MSRIndex": "0x3F7", + "MSRValue": "0x12", + "PublicDescription": "Counts retired Instructions who experienced = Instruction L1 Cache true miss.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired Instructions who experienced Instruct= ion L2 Cache true miss.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.L2_MISS", + "MSRIndex": "0x3F7", + "MSRValue": "0x13", + "PublicDescription": "Counts retired Instructions who experienced = Instruction L2 Cache true miss.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 128 cycles= which was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_128", + "MSRIndex": "0x3F7", + "MSRValue": "0x608006", + "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 12= 8 cycles which was not interrupted by a back-end stall.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 16 cycles = which was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_16", + "MSRIndex": "0x3F7", + "MSRValue": "0x601006", + "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 16 cycles. During th= is period the front-end delivered no uops.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions after front-end starvati= on of at least 2 cycles", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_2", + "MSRIndex": "0x3F7", + "MSRValue": "0x600206", + "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 2 cycles which was not interrupted by a back-end stall.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 256 cycles= which was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_256", + "MSRIndex": "0x3F7", + "MSRValue": "0x610006", + "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 25= 6 cycles which was not interrupted by a back-end stall.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end had at least 1 bubble-slot for a period of 2= cycles which was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1", + "MSRIndex": "0x3F7", + "MSRValue": "0x100206", + "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after the front-end had at least 1 bubble-slot for a per= iod of 2 cycles. A bubble-slot is an empty issue-pipeline slot while there = was no RAT stall.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 32 cycles = which was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_32", + "MSRIndex": "0x3F7", + "MSRValue": "0x602006", + "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 32 cycles. During th= is period the front-end delivered no uops.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 4 cycles w= hich was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_4", + "MSRIndex": "0x3F7", + "MSRValue": "0x600406", + "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 4 = cycles which was not interrupted by a back-end stall.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 512 cycles= which was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_512", + "MSRIndex": "0x3F7", + "MSRValue": "0x620006", + "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 51= 2 cycles which was not interrupted by a back-end stall.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 64 cycles = which was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_64", + "MSRIndex": "0x3F7", + "MSRValue": "0x604006", + "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 64= cycles which was not interrupted by a back-end stall.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that are fetched after a= n interval where the front-end delivered no uops for a period of 8 cycles w= hich was not interrupted by a back-end stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.LATENCY_GE_8", + "MSRIndex": "0x3F7", + "MSRValue": "0x600806", + "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 8 cycles. During thi= s period the front-end delivered no uops.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Mispredicted Retired ANT branches", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.MISP_ANT", + "MSRIndex": "0x3F7", + "MSRValue": "0x9", + "PublicDescription": "ANT retired branches that got just mispredic= ted", + "SampleAfterValue": "100007", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts flows delivered by the Microcode Seque= ncer", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.MS_FLOWS", + "MSRIndex": "0x3F7", + "MSRValue": "0x8", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of instruction retired tagg= ed after a wasted issue slot if none of the previous events occurred", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.OTHER", + "SampleAfterValue": "1000003", + "UMask": "0x80", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of instruction retired that= are tagged after a branch instruction causes bubbles/empty issue slots due= to a predecode wrong", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.PREDECODE", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Retired Instructions who experienced STLB (2n= d level TLB) true miss.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.STLB_MISS", + "MSRIndex": "0x3F7", + "MSRValue": "0x15", + "PublicDescription": "Counts retired Instructions that experienced= STLB (2nd level TLB) true miss.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired instructions that caused clears due t= o being Unknown Branches.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc6", + "EventName": "FRONTEND_RETIRED.UNKNOWN_BRANCH", + "MSRIndex": "0x3F7", + "MSRValue": "0x17", + "PublicDescription": "Number retired branch instructions that caus= ed the front-end to be resteered when it finds the instruction in a fetch l= ine. This is called Unknown Branch which occurs for the first time a branch= instruction is fetched or when the branch is not tracked by the BPU (Branc= h Prediction Unit) anymore.", + "SampleAfterValue": "100007", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged because empty issue slots were seen before the uop due to Ins= truction L1 cache miss, that hit in the L2 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc9", + "EventName": "FRONTEND_RETIRED_SOURCE.ICACHE_L2_HIT", + "PublicDescription": "Counts the number of instructions retired th= at were tagged because empty issue slots were seen before the uop due to In= struction L1 cache miss, that hit in the L2 cache. Includes L2 Hit resulti= ng from and L1D eviction of another core in the same module which is longer= latency than a typical L2 hit.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged because empty issue slots were seen before the uop due to ITL= B miss that hit in the second level TLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc9", + "EventName": "FRONTEND_RETIRED_SOURCE.ITLB_STLB_HIT", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of instructions retired tha= t were tagged because empty issue slots were seen before the uop due to ITL= B miss that also missed the second level TLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc9", + "EventName": "FRONTEND_RETIRED_SOURCE.ITLB_STLB_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts every time the code stream enters into= a new cache line by walking sequential from the previous line or being red= irected by a jump.", "Counter": "0,1,2,3,4,5,6,7", @@ -8,6 +450,15 @@ "UMask": "0x3", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts every time the code stream enters into= a new cache line by walking sequential from the previous line or being red= irected by a jump and the instruction cache registers bytes are present.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x80", + "EventName": "ICACHE.HIT", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts every time the code stream enters into= a new cache line by walking sequential from the previous line or being red= irected by a jump and the instruction cache registers bytes are not present= . -", "Counter": "0,1,2,3,4,5,6,7", @@ -18,13 +469,212 @@ "Unit": "cpu_atom" }, { - "BriefDescription": "This event counts a subset of the Topdown Slo= ts event that were no operation was delivered to the back-end pipeline due = to instruction fetch limitations when the back-end could have accepted more= operations. Common examples include instruction cache misses or x86 instru= ction decode limitations.", + "BriefDescription": "Cycles where a code fetch is stalled due to L= 1 instruction cache miss.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x80", + "EventName": "ICACHE_DATA.STALLS", + "PublicDescription": "Counts cycles where a code line fetch is sta= lled due to an L1 instruction cache miss. The decode pipeline works at a 32= Byte granularity.", + "SampleAfterValue": "500009", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "ICACHE_DATA.STALL_PERIODS", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0x80", + "EventName": "ICACHE_DATA.STALL_PERIODS", + "SampleAfterValue": "500009", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles where a code fetch is stalled due to L= 1 instruction cache tag miss.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x83", + "EventName": "ICACHE_TAG.STALLS", + "PublicDescription": "Counts cycles where a code fetch is stalled = due to L1 instruction cache tag miss.", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles Decode Stream Buffer (DSB) is deliveri= ng any Uop", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x79", + "EventName": "IDQ.DSB_CYCLES_ANY", + "PublicDescription": "Counts the number of cycles uops were delive= red to Instruction Decode Queue (IDQ) from the Decode Stream Buffer (DSB) p= ath.", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles DSB is delivering optimal number of Uo= ps", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "8", + "EventCode": "0x79", + "EventName": "IDQ.DSB_CYCLES_OK", + "PublicDescription": "Counts the number of cycles where optimal nu= mber of uops was delivered to the Instruction Decode Queue (IDQ) from the D= SB (Decode Stream Buffer) path. Count includes uops that may 'bypass' the I= DQ.", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Uops delivered to Instruction Decode Queue (I= DQ) from the Decode Stream Buffer (DSB) path", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x79", + "EventName": "IDQ.DSB_UOPS", + "PublicDescription": "Counts the number of uops delivered to Instr= uction Decode Queue (IDQ) from the Decode Stream Buffer (DSB) path.", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles MITE is delivering any Uop", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x79", + "EventName": "IDQ.MITE_CYCLES_ANY", + "PublicDescription": "Counts the number of cycles uops were delive= red to the Instruction Decode Queue (IDQ) from the MITE (legacy decode pipe= line) path. During these cycles uops are not being delivered from the Decod= e Stream Buffer (DSB).", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles MITE is delivering optimal number of U= ops", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "8", + "EventCode": "0x79", + "EventName": "IDQ.MITE_CYCLES_OK", + "PublicDescription": "Counts the number of cycles where optimal nu= mber of uops was delivered to the Instruction Decode Queue (IDQ) from the M= ITE (legacy decode pipeline) path. During these cycles uops are not being d= elivered from the Decode Stream Buffer (DSB).", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Uops delivered to Instruction Decode Queue (I= DQ) from MITE path", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x79", + "EventName": "IDQ.MITE_UOPS", + "PublicDescription": "Counts the number of uops delivered to Instr= uction Decode Queue (IDQ) from the MITE path. This also means that uops are= not being delivered from the Decode Stream Buffer (DSB).", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when uops are being delivered to IDQ w= hile MS is busy", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x79", + "EventName": "IDQ.MS_CYCLES_ANY", + "PublicDescription": "Counts cycles during which uops are being de= livered to Instruction Decode Queue (IDQ) while the Microcode Sequencer (MS= ) is busy. Uops maybe initiated by Decode Stream Buffer (DSB) or MITE.", + "SampleAfterValue": "2000003", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of switches from DSB or MITE to the MS= ", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0x79", + "EventName": "IDQ.MS_SWITCHES", + "PublicDescription": "Number of switches from DSB (Decode Stream B= uffer) or MITE (legacy decode pipeline) to the Microcode Sequencer.", + "SampleAfterValue": "100003", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Uops initiated by MITE or Decode Stream Buffe= r (DSB) and delivered to Instruction Decode Queue (IDQ) while Microcode Seq= uencer (MS) is busy", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x79", + "EventName": "IDQ.MS_UOPS", + "PublicDescription": "Counts the number of uops initiated by MITE = or Decode Stream Buffer (DSB) and delivered to Instruction Decode Queue (ID= Q) while the Microcode Sequencer (MS) is busy. Counting includes uops that = may 'bypass' the IDQ.", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event counts a subset of the Topdown Slo= ts event that when no operation was delivered to the back-end pipeline due = to instruction fetch limitations when the back-end could have accepted more= operations. Common examples include instruction cache misses or x86 instru= ction decode limitations.", "Counter": "0,1,2,3,4,5,6,7,8,9", "EventCode": "0x9c", "EventName": "IDQ_BUBBLES.CORE", - "PublicDescription": "This event counts a subset of the Topdown Sl= ots event that were no operation was delivered to the back-end pipeline due= to instruction fetch limitations when the back-end could have accepted mor= e operations. Common examples include instruction cache misses or x86 instr= uction decode limitations. Software can use this event as the numerator for= the Frontend Bound metric (or top-level category) of the Top-down Microarc= hitecture Analysis method.", + "PublicDescription": "This event counts a subset of the Topdown Sl= ots event that when no operation was delivered to the back-end pipeline due= to instruction fetch limitations when the back-end could have accepted mor= e operations. Common examples include instruction cache misses or x86 instr= uction decode limitations. Software can use this event as the numerator for= the Frontend Bound metric (or top-level category) of the Top-down Microarc= hitecture Analysis method.", "SampleAfterValue": "1000003", "UMask": "0x1", "Unit": "cpu_core" + }, + { + "BriefDescription": "This event is deprecated. [This event is alia= s to IDQ_BUBBLES.STARVATION_CYCLES]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "8", + "Deprecated": "1", + "EventCode": "0x9c", + "EventName": "IDQ_BUBBLES.CYCLES_0_UOPS_DELIV.CORE", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when optimal number of uops was delive= red to the back-end when the back-end is not stalled", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x9c", + "EventName": "IDQ_BUBBLES.CYCLES_FE_WAS_OK", + "Invert": "1", + "PublicDescription": "Counts the number of cycles when the optimal= number of uops were delivered by the Instruction Decode Queue (IDQ) to the= back-end of the pipeline when there was no back-end stalls.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when no uops are delivered by the IDQ = for 2 or more cycles when backend of the machine is not stalled - normally = indicating a Fetch Latency issue", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x9c", + "EventName": "IDQ_BUBBLES.FETCH_LATENCY", + "PublicDescription": "Counts the number of cycles when no uops wer= e delivered by the Instruction Decode Queue (IDQ) to the back-end of the pi= peline when there was no back-end stalls for 2 or more cycles - normally in= dicating a Fetch Latency issue.", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when no uops are not delivered by the = IDQ when backend of the machine is not stalled [This event is alias to IDQ_= BUBBLES.CYCLES_0_UOPS_DELIV.CORE]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "8", + "EventCode": "0x9c", + "EventName": "IDQ_BUBBLES.STARVATION_CYCLES", + "PublicDescription": "Counts the number of cycles when no uops wer= e delivered by the Instruction Decode Queue (IDQ) to the back-end of the pi= peline when there was no back-end stalls. [This event is alias to IDQ_BUBBL= ES.CYCLES_0_UOPS_DELIV.CORE]", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of cycles that the micro-se= quencer is busy.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe7", + "EventName": "MS_DECODED.MS_BUSY", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of times entered into a uco= de flow in the FEC. Includes inserted flows due to front-end detected faul= ts or assists.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe7", + "EventName": "MS_DECODED.MS_ENTRY", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of times nanocode flow is e= xecuted.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe7", + "EventName": "MS_DECODED.NANO_CODE", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_atom" } ] diff --git a/tools/perf/pmu-events/arch/x86/lunarlake/lnl-metrics.json b/to= ols/perf/pmu-events/arch/x86/lunarlake/lnl-metrics.json new file mode 100644 index 000000000000..bbacc9cf9df5 --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/lunarlake/lnl-metrics.json @@ -0,0 +1,4611 @@ +[ + { + "BriefDescription": "C10 residency percent per package", + "MetricExpr": "cstate_pkg@c10\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C10_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C1 residency percent per core", + "MetricExpr": "cstate_core@c1\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C1_Core_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C2 residency percent per package", + "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C2_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C3 residency percent per package", + "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C3_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C6 residency percent per core", + "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Core_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C6 residency percent per package", + "MetricExpr": "cstate_pkg@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C7 residency percent per core", + "MetricExpr": "cstate_core@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Core_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C7 residency percent per package", + "MetricExpr": "cstate_pkg@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C8 residency percent per package", + "MetricExpr": "cstate_pkg@c8\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C8_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C9 residency percent per package", + "MetricExpr": "cstate_pkg@c9\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C9_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Percentage of cycles spent in System Manageme= nt Interrupts.", + "MetricExpr": "((msr@aperf@ - cycles) / msr@aperf@ if msr@smi@ > 0= else 0)", + "MetricGroup": "smi", + "MetricName": "smi_cycles", + "MetricThreshold": "smi_cycles > 0.1", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Number of SMI interrupts.", + "MetricExpr": "msr@smi@", + "MetricGroup": "smi", + "MetricName": "smi_num", + "ScaleUnit": "1SMI#" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to certain allocation restrictions", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS@ / (8 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_allocation_restriction", + "MetricThreshold": "(tma_allocation_restriction >0.10) & ((tma_cor= e_bound >0.10) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "MetricExpr": "UOPS_DISPATCHED.ALU / (6 * tma_info_thread_clks)", + "MetricGroup": "Core_Execution;TopdownL5;tma_L5_group;tma_ports_ut= ilized_3m_group", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.4", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops delivered by the Microcode_Sequencer as a result of Assists= ", + "MetricExpr": "78 * ASSISTS.ANY / tma_info_thread_slots", + "MetricGroup": "BvIO;Slots_Estimated;TopdownL4;tma_L4_group;tma_mi= crocode_sequencer_group", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists", + "MetricExpr": "63 * ASSISTS.SSE_AVX_MIX / tma_info_thread_slots", + "MetricGroup": "HPC;Slots_Estimated;TopdownL5;tma_L5_group;tma_ass= ists_group", + "MetricName": "tma_avx_assists", + "MetricThreshold": "tma_avx_assists > 0.1", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", + "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.2", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This category represents fraction of slots wa= sted due to incorrect speculations", + "MetricExpr": "topdown\\-bad\\-spec / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Default;Fed;Frontend;IcMiss;Memo= ryTLB;Scaled_Slots;TopdownL1;tma_L1_group", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Default;Ret;Scaled_Slots;TopdownL1;tma_L1_gro= up", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_l1_latency_capacity + tma_lock_latency + tma_= split_loads + tma_fb_full)))", + "MetricGroup": "BvMB;Default;Mem;MemoryBW;Offcore;Scaled_Slots;Top= downL1;tma_L1_group;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_mem_b= andwidth, tma_sq_full", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_l1_latency_capacity + tma_lock_latency + tma_split_loads + t= ma_fb_full)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bo= und + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_latency_c= apacity / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + = tma_l1_latency_capacity + tma_lock_latency + tma_split_loads + tma_fb_full)= ) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l= 3_bound + tma_dram_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtl= b_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma_l1_latency_cap= acity + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bou= nd * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_split_loads / (tma_dtlb_load + tma_store_= fwd_blk + tma_l1_latency_dependency + tma_l1_latency_capacity + tma_lock_la= tency + tma_split_loads + tma_fb_full)) + tma_memory_bound * (tma_store_bou= nd / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_sto= re_bound)) * (tma_split_stores / (tma_store_latency + tma_false_sharing + t= ma_split_stores + tma_streaming_stores + tma_dtlb_store)) + tma_memory_boun= d * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_store_latency / (tma_store_latency + tm= a_false_sharing + tma_split_stores + tma_streaming_stores + tma_dtlb_store)= ))", + "MetricGroup": "BvML;Default;Mem;MemoryLat;Offcore;Scaled_Slots;To= pdownL1;tma_L1_group;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;Default;Scaled_Slots;TopdownL1;tma_L1_gro= up;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D0x1@) * (t= ma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_res= teers + tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispr= edicts) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_bra= nches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma_= ms_switches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms / = (tma_mite + tma_dsb + tma_lsd + tma_ms))) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Default;Fed;FetchBW;Frontend;Scaled_Slots;Top= downL1;tma_L1_group", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu_core@U= OPS_RETIRED.MS\\,cmask\\=3D0x1@) * (tma_fetch_latency * (tma_ms_switches + = tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma= _other_mispredicts / tma_branch_mispredicts) / (tma_mispredicts_resteers + = tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_itl= b_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switch= es) + tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_lsd + tma_ms= )) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / t= ma_other_nukes + tma_core_bound * (tma_serializing_operation + RS.EMPTY_RES= OURCE / tma_info_thread_clks * tma_ports_utilized_0) / (tma_divider + tma_s= erializing_operation + tma_ports_utilization) + tma_microcode_sequencer / (= tma_microcode_sequencer + tma_few_uops_instructions) * (tma_assists / tma_m= icrocode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Default;Ret;Scaled_Slots;TopdownL1;tm= a_L1_group;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (= tma_dtlb_load / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_depende= ncy + tma_l1_latency_capacity + tma_lock_latency + tma_split_loads + tma_fb= _full)) + tma_memory_bound * (tma_store_bound / (tma_l1_bound + tma_l2_boun= d + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_dtlb_store / (= tma_store_latency + tma_false_sharing + tma_split_stores + tma_streaming_st= ores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Default;Mem;MemoryTLB;Offcore;Scaled_Slots;To= pdownL1;tma_L1_group;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;Default;LockCont;Mem;Offcore;Scaled_Slots;Top= downL1;tma_L1_group;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;Default;Scaled_Slot= s;TopdownL1;tma_L1_group;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Default;Offcore;Scaled_Slots;TopdownL1;tm= a_L1_group", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_microcode_sequencer + tma_few_uops_instr= uctions) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Default;Ret;Scaled_Slots;TopdownL1;tma_L1_gro= up", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to BACLEARS, which occurs when the Branch T= arget Buffer (BTB) prediction or lack thereof, was corrected by a later bra= nch predictor in the frontend", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.BRANCH_DETECT@ / (8 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricName": "tma_branch_detect", + "MetricThreshold": "(tma_branch_detect >0.05) & ((tma_ifetch_laten= cy >0.15) & ((tma_frontend_bound >0.20)))", + "PublicDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to BACLEARS, which occurs when the Branch = Target Buffer (BTB) prediction or lack thereof, was corrected by a later br= anch predictor in the frontend. Includes BACLEARS due to all branch types i= ncluding conditional and unconditional jumps, returns, and indirect branche= s.", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", + "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * sl= ots", + "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", + "MetricName": "tma_branch_mispredicts", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost,= tma_mispredicts_resteers", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to BTCLEARS, which occurs when the Branch T= arget Buffer (BTB) predicts a taken branch.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.BRANCH_RESTEER@ / (8 * cp= u_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricName": "tma_branch_resteer", + "MetricThreshold": "(tma_branch_resteer >0.05) & ((tma_ifetch_late= ncy >0.15) & ((tma_frontend_bound >0.20)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers", + "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", + "MetricGroup": "Clocks;FetchLat;TopdownL3;tma_L3_group;tma_fetch_l= atency_group;tma_overlap", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C01 / tma_info_thread_clks", + "MetricGroup": "C0Wait;Clocks;TopdownL4;tma_L4_group;tma_serializi= ng_operation_group", + "MetricName": "tma_c01_wait", + "MetricThreshold": "tma_c01_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C02 / tma_info_thread_clks", + "MetricGroup": "C0Wait;Clocks;TopdownL4;tma_L4_group;tma_serializi= ng_operation_group", + "MetricName": "tma_c02_wait", + "MetricThreshold": "tma_c02_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU retired uops originated from CISC (complex instruction set computer) in= struction", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Machine Clears", + "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", + "MetricGroup": "BadSpec;Clocks;MachineClears;TopdownL4;tma_L4_grou= p;tma_branch_resteers_group;tma_issueMC", + "MetricName": "tma_clears_resteers", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "MetricExpr": "max(0, FRONTEND_RETIRED.L1I_MISS * cpu_core@FRONTEN= D_RETIRED.L1I_MISS@R / tma_info_thread_clks - tma_code_l2_miss)", + "MetricGroup": "Clocks_Retired;FetchLat;IcMiss;Offcore;TopdownL4;t= ma_L4_group;tma_icache_misses_group", + "MetricName": "tma_code_l2_hit", + "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "MetricExpr": "FRONTEND_RETIRED.L2_MISS * cpu_core@FRONTEND_RETIRE= D.L2_MISS@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;FetchLat;IcMiss;Offcore;TopdownL4;t= ma_L4_group;tma_icache_misses_group", + "MetricName": "tma_code_l2_miss", + "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, FRONTEND_RETIRED.ITLB_MISS * cpu_core@FRONTE= ND_RETIRED.ITLB_MISS@R / tma_info_thread_clks - tma_code_stlb_miss)", + "MetricGroup": "Clocks_Retired;FetchLat;MemoryTLB;TopdownL4;tma_L4= _group;tma_itlb_misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "FRONTEND_RETIRED.STLB_MISS * cpu_core@FRONTEND_RETI= RED.STLB_MISS@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;FetchLat;MemoryTLB;TopdownL4;tma_L4= _group;tma_itlb_misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks * IT= LB_MISSES.WALK_COMPLETED_2M_4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISS= ES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "Clocks_Estimated;FetchLat;MemoryTLB;TopdownL5;tma_= L5_group;tma_code_stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks * IT= LB_MISSES.WALK_COMPLETED_4K / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.= WALK_COMPLETED_2M_4M)", + "MetricGroup": "Clocks_Estimated;FetchLat;MemoryTLB;TopdownL5;tma_= L5_group;tma_code_stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by non-taken conditional bran= ches", + "MetricExpr": "BR_MISP_RETIRED.COND_NTAKEN_COST * cpu_core@BR_MISP= _RETIRED.COND_NTAKEN_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_cond_nt_mispredicts", + "MetricThreshold": "tma_cond_nt_mispredicts > 0.05 & tma_branch_mi= spredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by backward-taken conditional branche= s", + "MetricExpr": "BR_MISP_RETIRED.COND_TAKEN_BWD_COST * cpu_core@BR_M= ISP_RETIRED.COND_TAKEN_BWD_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_cond_tk_bwd_mispredicts", + "MetricThreshold": "tma_cond_tk_bwd_mispredicts > 0.05 & tma_branc= h_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by forward-taken conditional branches= ", + "MetricExpr": "BR_MISP_RETIRED.COND_TAKEN_FWD_COST * cpu_core@BR_M= ISP_RETIRED.COND_TAKEN_FWD_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_cond_tk_fwd_mispredicts", + "MetricThreshold": "tma_cond_tk_fwd_mispredicts > 0.05 & tma_branc= h_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", + "MetricExpr": "((min(MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS * cpu_core@= MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS * (2= 7 * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequency) i= f 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@R else MEM_LOAD_L3_HIT_RET= IRED.XSNP_MISS * (27 * tma_info_system_core_frequency) - 3 * tma_info_syste= m_core_frequency) + (min(MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * cpu_core@MEM_L= OAD_L3_HIT_RETIRED.XSNP_HITM@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (28 * t= ma_info_system_core_frequency) - 3 * tma_info_system_core_frequency) if 0 <= cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM@R else MEM_LOAD_L3_HIT_RETIRED.= XSNP_HITM * (28 * tma_info_system_core_frequency) - 3 * tma_info_system_cor= e_frequency)) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2= ) / tma_info_thread_clks", + "MetricGroup": "BvMS;Clocks_Estimated;DataSharing;LockCont;Offcore= ;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_contested_accesses", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false= _sharing, tma_machine_clears", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e Core non-memory issues were of a bottleneck", + "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)", + "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", + "MetricExpr": "((min(MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD * cpu_cor= e@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FW= D * (27 * tma_info_system_core_frequency) - 3 * tma_info_system_core_freque= ncy) if 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@R else MEM_LOAD_L3= _HIT_RETIRED.XSNP_NO_FWD * (27 * tma_info_system_core_frequency) - 3 * tma_= info_system_core_frequency) + (min(MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * cpu_c= ore@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * = (28 * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequency)= if 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@R else MEM_LOAD_L3_HIT_RE= TIRED.XSNP_FWD * (28 * tma_info_system_core_frequency) - 3 * tma_info_syste= m_core_frequency)) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MIS= S / 2) / tma_info_thread_clks", + "MetricGroup": "BvMS;Clocks_Estimated;Offcore;Snoop;TopdownL4;tma_= L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to decode stalls.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.DECODE@ / (8 * cpu_atom@C= PU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_decode", + "MetricThreshold": "(tma_decode >0.05) & ((tma_ifetch_bandwidth >0= .10) & ((tma_frontend_bound >0.20)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Divider unit was active", + "MetricExpr": "ARITH.DIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "BvCB;Clocks;TopdownL3;tma_L3_group;tma_core_bound_= group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIV_ACTIVE", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled on accesses to external memory (DRAM) by loads", + "MetricExpr": "MEMORY_STALLS.MEM / tma_info_thread_clks", + "MetricGroup": "MemoryBound;Stalls;TmaL3mem;TopdownL3;tma_L3_group= ;tma_memory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to DSB (decoded uop cache) fetch pipe= line", + "MetricExpr": "(cpu_core@IDQ.DSB_UOPS\\,cmask\\=3D0x8\\,inv\\=3D0x= 1@ + IDQ.DSB_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS) * (IDQ_BUBBLES.CYCLES_0_= UOPS_DELIV.CORE - IDQ_BUBBLES.FETCH_LATENCY)) / tma_info_thread_clks", + "MetricGroup": "DSB;FetchBW;Slots_Estimated;TopdownL3;tma_L3_group= ;tma_fetch_bandwidth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", + "MetricGroup": "Clocks;DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma= _fetch_latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricExpr": "(min(MEM_INST_RETIRED.STLB_HIT_LOADS * cpu_core@MEM= _INST_RETIRED.STLB_HIT_LOADS@R, MEM_INST_RETIRED.STLB_HIT_LOADS * 7) if 0 <= cpu_core@MEM_INST_RETIRED.STLB_HIT_LOADS@R else MEM_INST_RETIRED.STLB_HIT_= LOADS * 7) / tma_info_thread_clks + tma_load_stlb_miss", + "MetricGroup": "BvMT;Clocks_Estimated;MemoryTLB;TopdownL4;tma_L4_g= roup;tma_issueTLB;tma_l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", + "MetricExpr": "(min(MEM_INST_RETIRED.STLB_HIT_STORES * cpu_core@ME= M_INST_RETIRED.STLB_HIT_STORES@R, MEM_INST_RETIRED.STLB_HIT_STORES * 7) if = 0 < cpu_core@MEM_INST_RETIRED.STLB_HIT_STORES@R else MEM_INST_RETIRED.STLB_= HIT_STORES * 7) / tma_info_thread_clks + tma_store_stlb_miss", + "MetricGroup": "BvMT;Clocks_Estimated;MemoryTLB;TopdownL4;tma_L4_g= roup;tma_issueTLB;tma_store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", + "MetricExpr": "28 * tma_info_system_core_frequency * OCR.DEMAND_RF= O.L3_HIT.SNOOP_HITM / tma_info_thread_clks", + "MetricGroup": "BvMS;Clocks_Estimated;DataSharing;LockCont;Offcore= ;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_store_bound_group", + "MetricName": "tma_false_sharing", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that does not require the = use of microcode, classified as a fast nuke, due to memory ordering, memory= disambiguation and memory renaming", + "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.FASTNUKE@ / (8 * c= pu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", + "MetricName": "tma_fast_nuke", + "MetricThreshold": "(tma_fast_nuke >0.05) & ((tma_machine_clears >= 0.05) & ((tma_bad_speculation >0.15)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", + "MetricExpr": "L1D_MISS.FB_FULL / tma_info_thread_clks", + "MetricGroup": "BvMB;Clocks_Calculated;MemoryBW;TopdownL4;tma_L4_g= roup;tma_issueBW;tma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend bandwidth issues", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)", + "MetricGroup": "Default;FetchBW;Frontend;Slots;TmaL2;TopdownL2;tma= _L2_group;tma_frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.2", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Rel= ated metrics: tma_dsb_switches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_= botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_inst_mix_ipt= b, tma_lcp", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend latency issues", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Default;Frontend;Slots;TmaL2;TopdownL2;tma_L2_grou= p;tma_frontend_bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", + "MetricExpr": "max(0, tma_heavy_operations - tma_microcode_sequenc= er)", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_heavy_operations_= group;tma_issueD0", + "MetricName": "tma_few_uops_instructions", + "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents overall arithmetic flo= ating-point (FP) operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", + "MetricGroup": "HPC;TopdownL3;Uops;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Floating Point (FP) Assists", + "MetricExpr": "30 * ASSISTS.FP / tma_info_thread_slots", + "MetricGroup": "HPC;Slots_Estimated;TopdownL5;tma_L5_group;tma_ass= ists_group", + "MetricName": "tma_fp_assists", + "MetricThreshold": "tma_fp_assists > 0.1", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "MetricExpr": "ARITH.FPDIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_fp_divider", + "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) scalar uops fraction the CPU has retired", + "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / (tma_retiring * tma_= info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;Uops;tma_L4_group;tma_fp_a= rith_group;tma_issue2P", + "MetricName": "tma_fp_scalar", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b, tma_ports_utili= zed_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) vector uops fraction the CPU has retired aggregated across all v= ector widths", + "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / (tma_retiring * tma_= info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;Uops;tma_L4_group;tma_fp_a= rith_group;tma_issue2P", + "MetricName": "tma_fp_vector", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, = tma_int_vector_256b, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 128-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", + "MetricGroup": "Compute;Flops;TopdownL5;Uops;tma_L5_group;tma_fp_v= ector_group;tma_issue2P", + "MetricName": "tma_fp_vector_128b", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b,= tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 256-bit wide vectors", + "MetricExpr": "cpu_core@FP_ARITH_INST_RETIRED.VECTOR\\,umask\\=3D0= x30@ / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;Uops;tma_L5_group;tma_fp_v= ector_group;tma_issue2P", + "MetricName": "tma_fp_vector_256b", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_int_vector_128b, tma_int_vector_256b,= tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This category represents fraction of slots wh= ere the processor's Frontend undersupplies its Backend", + "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.MACRO_FUSED / (= tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Branches;BvBO;Pipeline;Slots;TopdownL3;tma_L3_grou= p;tma_light_operations_group", + "MetricName": "tma_fused_instructions", + "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Default;Retire;Slots;TmaL2;TopdownL2;tma_L2_group;= tma_retiring_group", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+]). Sample with: UOPS_RET= IRED.HEAVY", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", + "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend bandwidth restrictions due to d= ecode, predecode, cisc, and other limitations.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH@ / (8 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_ifetch_bandwidth", + "MetricThreshold": "(tma_ifetch_bandwidth >0.10) & ((tma_frontend_= bound >0.20))", + "MetricgroupNoGroup": "TopdownL2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend latency restrictions due to ica= che misses, itlb misses, branch detection, and resteer limitations.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_LATENCY@ / (8 * = cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_ifetch_latency", + "MetricThreshold": "(tma_ifetch_latency >0.15) & ((tma_frontend_bo= und >0.20))", + "MetricgroupNoGroup": "TopdownL2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect CALL instructions= ", + "MetricExpr": "BR_MISP_RETIRED.INDIRECT_CALL_COST * cpu_core@BR_MI= SP_RETIRED.INDIRECT_CALL_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_ind_call_mispredicts", + "MetricThreshold": "tma_ind_call_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect JMP instructions", + "MetricExpr": "max((BR_MISP_RETIRED.INDIRECT_COST * cpu_core@BR_MI= SP_RETIRED.INDIRECT_COST@R - BR_MISP_RETIRED.INDIRECT_CALL_COST * cpu_core@= BR_MISP_RETIRED.INDIRECT_CALL_COST@R) / tma_info_thread_clks, 0)", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_ind_jump_mispredicts", + "MetricThreshold": "tma_ind_jump_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Floating Point (FP) Operatio= n", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@FP_FLOPS_RETI= RED.ALL@", + "MetricGroup": "Flops", + "MetricName": "tma_info_arith_inst_mix_ipflop", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic instruction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@FP_INST_RETIR= ED.ALL@", + "MetricGroup": "Flops", + "MetricName": "tma_info_arith_inst_mix_ipfparith", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / (cpu_atom@FP_INST_RETI= RED.128B_DP@ + cpu_atom@FP_INST_RETIRED.128B_SP@)", + "MetricGroup": "Flops", + "MetricName": "tma_info_arith_inst_mix_ipfparith_avx128", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX 256-bit in= struction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / (cpu_atom@FP_INST_RETI= RED.256B_DP@ + cpu_atom@FP_INST_RETIRED.256B_SP@)", + "MetricGroup": "Flops", + "MetricName": "tma_info_arith_inst_mix_ipfparith_avx256", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@FP_INST_RETIR= ED.64B_DP@", + "MetricGroup": "Flops", + "MetricName": "tma_info_arith_inst_mix_ipfparith_scalar_dp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@FP_INST_RETIR= ED.32B_SP@", + "MetricGroup": "Flops", + "MetricName": "tma_info_arith_inst_mix_ipfparith_scalar_sp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 8 / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "MetricGroup": "Bad;BrMispredicts;Core_Metric;tma_issueBM", + "MetricName": "tma_info_bad_spec_branch_misprediction_cost", + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", + "MetricGroup": "Bad;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", + "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for cond= itional backward-taken branches (lower number means higher occurrence rate)= ", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN_BWD", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_cond_taken_bwd", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for cond= itional forward-taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN_FWD", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_cond_taken_fwd", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", + "MetricGroup": "Bad;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmisp_indirect", + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", + "MetricGroup": "Bad;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmisp_ret", + "MetricThreshold": "tma_info_bad_spec_ipmisp_ret < 500", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmispredict", + "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", + "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", + "MetricGroup": "BrMispredicts;Metric", + "MetricName": "tma_info_bad_spec_spec_clears_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_lsd + tma_ms)))", + "MetricGroup": "DSB;Fed;FetchBW;Scaled_Slots;tma_issueFB", + "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", + "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd + tma_ms))", + "MetricGroup": "DSBmiss;Fed;Scaled_Slots;tma_issueFB", + "MetricName": "tma_info_botlnk_l2_dsb_misses", + "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) misse= s - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb= _switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", + "MetricGroup": "Fed;FetchLat;IcMiss;Scaled_Slots;tma_issueFL", + "MetricName": "tma_info_botlnk_l2_ic_misses", + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that retirement is stalled= due to a first level data TLB miss", + "MetricExpr": "100 * (cpu_atom@LD_HEAD.DTLB_MISS_AT_RET@ + cpu_ato= m@LD_HEAD.PGWALK_AT_RET@) / cpu_atom@CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_bottleneck_%_dtlb_miss_bound_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that allocation and retire= ment is stalled by the Frontend Cluster due to an Ifetch Miss, either Icach= e or ITLB Miss", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.ALL@ / cpu_a= tom@CPU_CLK_UNHALTED.CORE@", + "MetricGroup": "Ifetch", + "MetricName": "tma_info_bottleneck_%_ifetch_miss_bound_cycles", + "PublicDescription": "Percentage of time that allocation and retir= ement is stalled by the Frontend Cluster due to an Ifetch Miss, either Icac= he or ITLB Miss. See Info.Ifetch_Bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that retirement is stalled= due to an L1 miss", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_LOAD.ALL@ / cpu_ato= m@CPU_CLK_UNHALTED.CORE@", + "MetricGroup": "Load_Store_Miss", + "MetricName": "tma_info_bottleneck_%_load_miss_bound_cycles", + "PublicDescription": "Percentage of time that retirement is stalle= d due to an L1 miss. See Info.Load_Miss_Bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that retirement is stalled= by the Memory Cluster due to a pipeline stall", + "MetricExpr": "100 * cpu_atom@LD_HEAD.ANY_AT_RET@ / cpu_atom@CPU_C= LK_UNHALTED.CORE@", + "MetricGroup": "Mem_Exec", + "MetricName": "tma_info_bottleneck_%_mem_exec_bound_cycles", + "PublicDescription": "Percentage of time that retirement is stalle= d by the Memory Cluster due to a pipeline stall. See Info.Mem_Exec_Bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.ALL_BRANCHES@", + "MetricName": "tma_info_br_inst_mix_ipbranch", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instruction per (near) call (lower number mea= ns higher occurrence rate)", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.NEAR_CALL@", + "MetricName": "tma_info_br_inst_mix_ipcall", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.FAR_BRANCH@u", + "MetricName": "tma_info_br_inst_mix_ipfarbranch", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired conditional Branch M= isprediction where the branch was not taken", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / (cpu_atom@BR_MISP_RETI= RED.COND@ - cpu_atom@BR_MISP_RETIRED.COND_TAKEN@)", + "MetricName": "tma_info_br_inst_mix_ipmisp_cond_ntaken", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired conditional Branch M= isprediction where the branch was taken", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.COND_TAKEN@", + "MetricName": "tma_info_br_inst_mix_ipmisp_cond_taken", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired indirect call or jum= p Branch Misprediction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.INDIRECT@", + "MetricName": "tma_info_br_inst_mix_ipmisp_indirect", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired return Branch Mispre= diction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.RETURN@", + "MetricName": "tma_info_br_inst_mix_ipmisp_ret", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired Branch Misprediction= ", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.ALL_BRANCHES@", + "MetricName": "tma_info_br_inst_mix_ipmispredict", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Ratio of all branches which mispredict", + "MetricExpr": "cpu_atom@BR_MISP_RETIRED.ALL_BRANCHES@ / cpu_atom@B= R_INST_RETIRED.ALL_BRANCHES@", + "MetricName": "tma_info_br_mispredict_bound_branch_mispredict_rati= o", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Ratio between Mispredicted branches and unkno= wn branches", + "MetricExpr": "cpu_atom@BR_MISP_RETIRED.ALL_BRANCHES@ / cpu_atom@B= ACLEARS.ANY@", + "MetricName": "tma_info_br_mispredict_bound_branch_mispredict_to_u= nknown_branch_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are CALL or RET", + "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_R= ETURN) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;Fraction", + "MetricName": "tma_info_branches_callret", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are non-taken condi= tionals", + "MetricExpr": "BR_INST_RETIRED.COND_NTAKEN / BR_INST_RETIRED.ALL_B= RANCHES", + "MetricGroup": "Bad;Branches;CodeGen;Fraction;PGO", + "MetricName": "tma_info_branches_cond_nt", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are forward taken c= onditionals", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN_BWD / BR_INST_RETIRED.AL= L_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;Fraction;PGO", + "MetricName": "tma_info_branches_cond_tk_bwd", + "MetricThreshold": "tma_info_branches_cond_tk_bwd > 0.3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are forward taken c= onditionals", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN_FWD / BR_INST_RETIRED.AL= L_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;Fraction;PGO", + "MetricName": "tma_info_branches_cond_tk_fwd", + "MetricThreshold": "tma_info_branches_cond_tk_fwd > 0.2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are unconditional (= direct or indirect) jumps", + "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_= TAKEN_BWD - BR_INST_RETIRED.COND_TAKEN_FWD - 2 * BR_INST_RETIRED.NEAR_CALL)= / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;Fraction", + "MetricName": "tma_info_branches_jump", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches of other types (not indi= vidually covered by other metrics in Info.Branches group)", + "MetricExpr": "1 - (tma_info_branches_cond_nt + tma_info_branches_= cond_tk_bwd + tma_info_branches_cond_tk_fwd + tma_info_branches_callret + t= ma_info_branches_jump)", + "MetricGroup": "Bad;Branches;Fraction", + "MetricName": "tma_info_branches_other_branches", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that allocation is stalled= due to load buffer full", + "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.LD_BUF@ / cpu_at= om@CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_buffer_stalls_%_load_buffer_stall_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that allocation is stalled= due to memory reservation stations full", + "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.RSV@ / cpu_atom@= CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_buffer_stalls_%_mem_rsv_stall_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that allocation is stalled= due to store buffer full", + "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.ST_BUF@ / cpu_at= om@CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_buffer_stalls_%_store_buffer_stall_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles Per Instruction", + "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.CORE@ / cpu_atom@INST_RET= IRED.ANY@", + "MetricName": "tma_info_core_cpi", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "uops Executed per Cycle", + "MetricExpr": "UOPS_EXECUTED.THREAD / tma_info_thread_clks", + "MetricGroup": "Metric;Power", + "MetricName": "tma_info_core_epc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Floating Point Operations Per Cycle", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / tma_info_thread_clks", + "MetricGroup": "Flops", + "MetricName": "tma_info_core_flopc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Actual per-core usage of the Floating Point n= on-X87 execution units (regardless of precision or vector-width)", + "MetricExpr": "(FP_ARITH_DISPATCHED.V0 + FP_ARITH_DISPATCHED.V1 + = FP_ARITH_DISPATCHED.V2 + FP_ARITH_DISPATCHED.V3) / (4 * tma_info_thread_clk= s)", + "MetricGroup": "Cor;Core_Metric;Flops;HPC", + "MetricName": "tma_info_core_fp_arith_utilization", + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu_core@UOPS_EXECUTED.THREA= D\\,cmask\\=3D0x1@", + "MetricGroup": "Backend;Cor;Metric;Pipeline;PortsUtil", + "MetricName": "tma_info_core_ilp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions Per Cycle", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@CPU_CLK_UNHAL= TED.CORE@", + "MetricName": "tma_info_core_ipc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "cpu_atom@TOPDOWN_RETIRING.ALL@ / cpu_atom@INST_RETI= RED.ANY@", + "MetricName": "tma_info_core_upi", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of Uops delivered by the DSB (aka De= coded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "DSB;Fed;FetchBW;Metric;tma_issueFB", + "MetricName": "tma_info_frontend_dsb_coverage", + "MetricThreshold": "tma_info_frontend_dsb_coverage < 0.7 & tma_inf= o_thread_ipc / 8 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu_core@DSB2MIT= E_SWITCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricGroup": "DSBmiss;Metric", + "MetricName": "tma_info_frontend_dsb_switch_cost", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired DSB misses", + "MetricExpr": "FRONTEND_RETIRED.ANY_DSB_MISS * cpu_core@FRONTEND_R= ETIRED.ANY_DSB_MISS@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;DSBmiss;Fed;FetchLat", + "MetricName": "tma_info_frontend_dsb_switches_ret", + "MetricThreshold": "tma_info_frontend_dsb_switches_ret > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", + "MetricExpr": "UOPS_ISSUED.ANY / cpu_core@UOPS_ISSUED.ANY\\,cmask\= \=3D0x1@", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_frontend_fetch_upc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Latency for L1 instruction cache miss= es", + "MetricExpr": "ICACHE_DATA.STALLS / ICACHE_DATA.STALL_PERIODS", + "MetricGroup": "Fed;FetchLat;IcMiss;Metric", + "MetricName": "tma_info_frontend_icache_miss_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per non-speculative DSB miss (lo= wer number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", + "MetricGroup": "DSBmiss;Fed;Inst_Metric", + "MetricName": "tma_info_frontend_ipdsb_miss_ret", + "MetricThreshold": "tma_info_frontend_ipdsb_miss_ret < 50", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per speculative Unknown Branch M= isprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_inst_mix_instructions / BACLEARS.ANY", + "MetricGroup": "Fed;Metric", + "MetricName": "tma_info_frontend_ipunknown_branch", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache true code cacheline misses per kilo = instruction", + "MetricExpr": "1e3 * FRONTEND_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss;Metric", + "MetricName": "tma_info_frontend_l2mpki_code", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache speculative code cacheline misses pe= r kilo instruction", + "MetricExpr": "1e3 * L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss;Metric", + "MetricName": "tma_info_frontend_l2mpki_code_all", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of Uops delivered by the LSD (Loop S= tream Detector; aka Loop Cache)", + "MetricExpr": "LSD.UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "Fed;LSD;Metric", + "MetricName": "tma_info_frontend_lsd_coverage", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired operations that invoke th= e Microcode Sequencer", + "MetricExpr": "FRONTEND_RETIRED.MS_FLOWS * cpu_core@FRONTEND_RETIR= ED.MS_FLOWS@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;Fed;FetchLat;MicroSeq", + "MetricName": "tma_info_frontend_ms_latency_ret", + "MetricThreshold": "tma_info_frontend_ms_latency_ret > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW;Metric", + "MetricName": "tma_info_frontend_tbpc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of cycles the front-end was de= layed due to an Unknown Branch detection", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu_core@INT_MISC.= UNKNOWN_BRANCH_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricGroup": "Fed;Metric", + "MetricName": "tma_info_frontend_unknown_branch_cost", + "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired branches who got branch a= ddress clears", + "MetricExpr": "FRONTEND_RETIRED.UNKNOWN_BRANCH * cpu_core@FRONTEND= _RETIRED.UNKNOWN_BRANCH@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;Fed;FetchLat", + "MetricName": "tma_info_frontend_unknown_branches_ret", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L2", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.L2_HIT@ / cp= u_atom@MEM_BOUND_STALLS_IFETCH.ALL@", + "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 2hit", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss doesn't hit in the L2", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.L2_MISS@ / c= pu_atom@MEM_BOUND_STALLS_IFETCH.ALL@", + "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 2miss", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Branch instructions per taken branch", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", + "MetricGroup": "Branches;Fed;Metric;PGO", + "MetricName": "tma_info_inst_mix_bptkbranch", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Count;Summary;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_info_inst_mix_instructions", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "Total number of retired Instructions. Sample= with: INST_RETIRED.PREC_DIST", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic instruction (l= ower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = FP_ARITH_INST_RETIRED.VECTOR)", + "MetricGroup": "Flops;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith", + "MetricThreshold": "tma_info_inst_mix_iparith < 10", + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_avx128", + "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_avx256", + "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOU= BLE", + "MetricGroup": "Flops;FpScalar;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_scalar_dp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SIN= GLE", + "MetricGroup": "Flops;FpScalar;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_scalar_sp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Branches;Fed;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipbranch", + "MetricThreshold": "tma_info_inst_mix_ipbranch < 8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per (near) call (lower number me= ans higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", + "MetricGroup": "Branches;Fed;Inst_Metric;PGO", + "MetricName": "tma_info_inst_mix_ipcall", + "MetricThreshold": "tma_info_inst_mix_ipcall < 200", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Floating Point (FP) Operatio= n (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_= FLOPS + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipflop", + "MetricThreshold": "tma_info_inst_mix_ipflop < 10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Load (lower number means hig= her occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", + "MetricGroup": "InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipload", + "MetricThreshold": "tma_info_inst_mix_ipload < 3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per PAUSE (lower number means hi= gher occurrence rate)", + "MetricExpr": "tma_info_inst_mix_instructions / CPU_CLK_UNHALTED.P= AUSE_INST", + "MetricGroup": "Flops;FpVector;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ippause", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Store (lower number means hi= gher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", + "MetricGroup": "InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipstore", + "MetricThreshold": "tma_info_inst_mix_ipstore < 8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_SWPF", + "MetricGroup": "Inst_Metric;Prefetches", + "MetricName": "tma_info_inst_mix_ipswpf", + "MetricThreshold": "tma_info_inst_mix_ipswpf < 100", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per taken branch", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;Inst_Metric;PGO;tma_= issueFB", + "MetricName": "tma_info_inst_mix_iptb", + "MetricThreshold": "tma_info_inst_mix_iptb < 8 * 2 + 1", + "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L2", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_LOAD.L2_HIT@ / cpu_= atom@MEM_BOUND_STALLS_LOAD.ALL@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l2hit= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that subsequently misses in the L2", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_LOAD.L2_MISS@ / cpu= _atom@MEM_BOUND_STALLS_LOAD.ALL@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l2mis= s", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the oldest l= oad of the load buffer is stalled at retirement due to a pipeline block", + "MetricExpr": "100 * cpu_atom@LD_HEAD.L1_BOUND_AT_RET@ / cpu_atom@= CPU_CLK_UNHALTED.CORE@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_store_bound_l1_bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the oldest l= oad of the load buffer is stalled at retirement", + "MetricExpr": "100 * (cpu_atom@LD_HEAD.L1_BOUND_AT_RET@ + cpu_atom= @MEM_BOUND_STALLS_LOAD.ALL@) / cpu_atom@CPU_CLK_UNHALTED.CORE@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_store_bound_load_bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to store buffer full", + "MetricExpr": "100 * (cpu_atom@MEM_SCHEDULER_BLOCK.ST_BUF@ / cpu_a= tom@MEM_SCHEDULER_BLOCK.ALL@) * tma_mem_scheduler", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_store_bound_store_bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to memory disambiguation", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.DISAMBIGUATION@ / cpu= _atom@INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_disamb_= pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to floating point assists", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.FP_ASSIST@ / cpu_atom= @INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_fp_assi= st_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to memory ordering", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.MEMORY_ORDERING@ / cp= u_atom@INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_monuke_= pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to memory renaming", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.MRN_NUKE@ / cpu_atom@= INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_mrn_pki= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to page faults", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.PAGE_FAULT@ / cpu_ato= m@INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_page_fa= ult_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads wit= h an address aliasing block", + "MetricExpr": "100 * cpu_atom@LD_BLOCKS.ADDRESS_ALIAS@ / cpu_atom@= MEM_UOPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_exec_blocks_%_loads_with_adressaliasin= g", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads wit= h a store forward or unknown store address block", + "MetricExpr": "100 * cpu_atom@LD_BLOCKS.DATA_UNKNOWN@ / cpu_atom@M= EM_UOPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_exec_blocks_%_loads_with_storefwdblk", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= first level data cache miss", + "MetricExpr": "100 * cpu_atom@LD_HEAD.L1_MISS_AT_RET@ / cpu_atom@L= D_HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_l1miss", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to o= ther block cases, such as pipeline conflicts, fences, etc", + "MetricExpr": "100 * cpu_atom@LD_HEAD.OTHER_AT_RET@ / cpu_atom@LD_= HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_otherpipeli= neblks", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= pagewalk", + "MetricExpr": "100 * cpu_atom@LD_HEAD.PGWALK_AT_RET@ / cpu_atom@LD= _HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_pagewalk", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= second level TLB miss", + "MetricExpr": "100 * cpu_atom@LD_HEAD.DTLB_MISS_AT_RET@ / cpu_atom= @LD_HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_stlbhit", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= store forward address match", + "MetricExpr": "100 * cpu_atom@LD_HEAD.ST_ADDR_AT_RET@ / cpu_atom@L= D_HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_storefwding= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Load", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@MEM_UOPS_RETI= RED.ALL_LOADS@", + "MetricName": "tma_info_mem_mix_ipload", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Store", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@MEM_UOPS_RETI= RED.ALL_STORES@", + "MetricName": "tma_info_mem_mix_ipstore", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads tha= t perform one or more locks", + "MetricExpr": "100 * cpu_atom@MEM_UOPS_RETIRED.LOCK_LOADS@ / cpu_a= tom@MEM_UOPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_mix_load_locks_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads tha= t are splits", + "MetricExpr": "100 * cpu_atom@MEM_UOPS_RETIRED.SPLIT_LOADS@ / cpu_= atom@MEM_UOPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_mix_load_splits_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Ratio of mem load uops to all uops", + "MetricExpr": "1e3 * cpu_atom@MEM_UOPS_RETIRED.ALL_LOADS@ / cpu_at= om@TOPDOWN_RETIRING.ALL@", + "MetricName": "tma_info_mem_mix_memload_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fill Buffer (FB) hits per kilo instructions f= or retired demand loads (L1D misses that merge into ongoing miss-handling e= ntries)", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_fb_hpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", + "MetricGroup": "Mem;MemoryBW;Metric", + "MetricName": "tma_info_memory_l1d_cache_fill_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= Level 0 within L1D cache [GB / sec]", + "MetricExpr": "64 * L1D.L0_REPLACEMENT / 1e9 / tma_info_system_tim= e", + "MetricGroup": "Mem;MemoryBW;Metric", + "MetricName": "tma_info_memory_l1dl0_cache_fill_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L1 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l1mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L1 cache true misses per kilo instruction for= all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.AN= Y", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l1mpki_load", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", + "MetricGroup": "Mem;MemoryBW;Metric", + "MetricName": "tma_info_memory_l2_cache_fill_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache hits per kilo instruction for all re= quest types (including speculative)", + "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_= RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2hpki_all", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache hits per kilo instruction for all de= mand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.AN= Y", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2hpki_load", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "Backend;CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all request types (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric;Offcore", + "MetricName": "tma_info_memory_l2mpki_all", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.A= NY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2mpki_load", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Offcore requests (L2 cache miss) per kilo ins= truction for demand RFOs", + "MetricExpr": "1e3 * L2_RQSTS.RFO_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Metric;Offcore", + "MetricName": "tma_info_memory_l2mpki_rfo", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", + "MetricGroup": "Mem;MemoryBW;Metric;Offcore", + "MetricName": "tma_info_memory_l3_cache_access_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", + "MetricGroup": "Mem;MemoryBW;Metric", + "MetricName": "tma_info_memory_l3_cache_fill_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L3 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", + "MetricGroup": "Mem;Metric", + "MetricName": "tma_info_memory_l3mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DATA_RD / OFFCORE_REQU= ESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Metric;Offcore", + "MetricName": "tma_info_memory_latency_data_l2_mlp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Latency for L2 cache miss demand Load= s", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "Clocks_Latency;LockCont;Memory_Lat;Offcore", + "MetricName": "tma_info_memory_latency_load_l2_miss_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu_c= ore@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", + "MetricGroup": "Memory_BW;Metric;Offcore", + "MetricName": "tma_info_memory_latency_load_l2_mlp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Latency for L3 cache miss demand Load= s", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAND_DATA_RD= / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", + "MetricGroup": "Clocks_Latency;Memory_Lat;Offcore", + "MetricName": "tma_info_memory_latency_load_l3_miss_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Actual Average Latency for L1 data-cache miss= demand load operations (in core cycles)", + "MetricExpr": "L1D_PENDING.LOAD / L1D_MISS.LOAD", + "MetricGroup": "Clocks_Latency;Mem;MemoryBound;MemoryLat", + "MetricName": "tma_info_memory_load_miss_real_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "\"Bus lock\" per kilo instruction", + "MetricExpr": "1e3 * SQ_MISC.BUS_LOCK / INST_RETIRED.ANY", + "MetricGroup": "Mem;Metric", + "MetricName": "tma_info_memory_mix_bus_lock_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Un-cacheable retired load per kilo instructio= n", + "MetricExpr": "1e3 * MEM_LOAD_MISC_RETIRED.UC / INST_RETIRED.ANY", + "MetricGroup": "Mem;Metric", + "MetricName": "tma_info_memory_mix_uc_load_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Memory-Level-Parallelism (average number of L= 1 miss demand load when there is at least one such miss", + "MetricExpr": "L1D_PENDING.LOAD / L1D_PENDING.LOAD_CYCLES", + "MetricGroup": "Mem;MemoryBW;MemoryBound;Metric", + "MetricName": "tma_info_memory_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Rate of L2 HW prefetched lines that were not = used by demand accesses", + "MetricExpr": "L2_LINES_OUT.USELESS_HWPF / (L2_LINES_OUT.SILENT + = L2_LINES_OUT.NON_SILENT)", + "MetricGroup": "Metric;Prefetches", + "MetricName": "tma_info_memory_prefetches_useless_hwpf", + "MetricThreshold": "tma_info_memory_prefetches_useless_hwpf > 0.15= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", + "MetricGroup": "Fed;MemoryTLB;Metric", + "MetricName": "tma_info_memory_tlb_code_stlb_mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand loads", + "MetricExpr": "MEM_INST_RETIRED.STLB_MISS_LOADS * cpu_core@MEM_INS= T_RETIRED.STLB_MISS_LOADS@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_load_stlb_miss_ret", + "MetricThreshold": "tma_info_memory_tlb_load_stlb_miss_ret > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "STLB (2nd level TLB) data load speculative mi= sses per kilo instruction (misses of any page-size that complete the page w= alk)", + "MetricExpr": "1e3 * DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRE= D.ANY", + "MetricGroup": "Mem;MemoryTLB;Metric", + "MetricName": "tma_info_memory_tlb_load_stlb_mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Utilization of the core's Page Walker(s) serv= ing STLB misses triggered by instruction/Load/Store accesses", + "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_P= ENDING + DTLB_STORE_MISSES.WALK_PENDING) / (4 * tma_info_thread_clks)", + "MetricGroup": "Core_Metric;Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_page_walks_utilization", + "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand stores", + "MetricExpr": "MEM_INST_RETIRED.STLB_MISS_STORES * cpu_core@MEM_IN= ST_RETIRED.STLB_MISS_STORES@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_store_stlb_miss_ret", + "MetricThreshold": "tma_info_memory_tlb_store_stlb_miss_ret > 0.05= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "STLB (2nd level TLB) data store speculative m= isses per kilo instruction (misses of any page-size that complete the page = walk)", + "MetricExpr": "1e3 * DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIR= ED.ANY", + "MetricGroup": "Mem;MemoryTLB;Metric", + "MetricName": "tma_info_memory_tlb_store_stlb_mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of uops fetched from DSB per c= ycle", + "MetricExpr": "IDQ.DSB_UOPS / IDQ.DSB_CYCLES_ANY", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_pipeline_fetch_dsb", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of uops fetched from LSD per c= ycle", + "MetricExpr": "LSD.UOPS / LSD.CYCLES_ACTIVE", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_pipeline_fetch_lsd", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of uops fetched from MITE per = cycle", + "MetricExpr": "IDQ.MITE_UOPS / IDQ.MITE_CYCLES_ANY", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_pipeline_fetch_mite", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per a microcode Assist invocatio= n", + "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", + "MetricGroup": "Inst_Metric;MicroSeq;Pipeline;Ret;Retire", + "MetricName": "tma_info_pipeline_ipassist", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", + "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D0x1@", + "MetricGroup": "Metric;Pipeline;Ret", + "MetricName": "tma_info_pipeline_retire", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", + "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu_core@UOPS_RETIRED.= SLOTS\\,cmask\\=3D0x1@", + "MetricGroup": "Metric;MicroSeq;Pipeline;Ret", + "MetricName": "tma_info_pipeline_strings_cycles", + "MetricThreshold": "tma_info_pipeline_strings_cycles > 0.1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that the core is stalled d= ue to a TPAUSE or UMWAIT instruction", + "MetricExpr": "100 * cpu_atom@SERIALIZATION.C01_MS_SCB@ / (8 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", + "MetricName": "tma_info_serialization _%_tpause_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of cycles the processor is waiting y= et unhalted; covering legacy PAUSE instruction, as well as C0.1 / C0.2 powe= r-performance optimized states", + "MetricExpr": "CPU_CLK_UNHALTED.C0_WAIT / tma_info_thread_clks", + "MetricGroup": "C0Wait;Metric", + "MetricName": "tma_info_system_c0_wait", + "MetricThreshold": "tma_info_system_c0_wait > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", + "MetricGroup": "Power;Summary;System_Metric", + "MetricName": "tma_info_system_core_frequency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average CPU Utilization (percentage)", + "MetricExpr": "tma_info_system_cpus_utilized / #num_cpus_online", + "MetricGroup": "HPC;Metric;Summary", + "MetricName": "tma_info_system_cpu_utilization", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of utilized CPUs", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": "Metric;Summary", + "MetricName": "tma_info_system_cpus_utilized", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / tma_info_system_time", + "MetricGroup": "Flops", + "MetricName": "tma_info_system_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;Inst_Metric;OS", + "MetricName": "tma_info_system_ipfarbranch", + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricGroup": "Metric;OS", + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THRE= AD", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_kernel_utilization", + "MetricThreshold": "tma_info_system_kernel_utilization > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Clocks;Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "power@energy\\-pkg@ * 61 / (tma_info_system_time * = 1e6)", + "MetricGroup": "Power;SoC;System_Metric", + "MetricName": "tma_info_system_power", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Seconds;Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", + "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_system_turbo_utilization", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Count;Pipeline", + "MetricName": "tma_info_thread_clks", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", + "MetricExpr": "1 / tma_info_thread_ipc", + "MetricGroup": "Mem;Metric;Pipeline", + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "The ratio of Executed- by Issued-Uops", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", + "MetricGroup": "Cor;Metric;Pipeline", + "MetricName": "tma_info_thread_execute_per_issue", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", + "MetricExpr": "INST_RETIRED.ANY / tma_info_thread_clks", + "MetricGroup": "Metric;Ret;Summary", + "MetricName": "tma_info_thread_ipc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "slots", + "MetricGroup": "Count;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_info_thread_slots", + "MetricgroupNoGroup": "TopdownL1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "tma_retiring * tma_info_thread_slots / INST_RETIRED= .ANY", + "MetricGroup": "Metric;Pipeline;Ret;Retire", + "MetricName": "tma_info_thread_uoppi", + "MetricThreshold": "tma_info_thread_uoppi > 1.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Uops per taken branch", + "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Metric", + "MetricName": "tma_info_thread_uptb", + "MetricThreshold": "tma_info_thread_uptb < 8 * 1.5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are FPDiv uops", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.FPDIV@ / cpu_atom@TOPDO= WN_RETIRING.ALL@", + "MetricName": "tma_info_uop_mix_fpdiv_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are IDiv uops", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.IDIV@ / cpu_atom@TOPDOW= N_RETIRING.ALL@", + "MetricName": "tma_info_uop_mix_idiv_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are microcode op= s", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.MS@ / cpu_atom@TOPDOWN_= RETIRING.ALL@", + "MetricName": "tma_info_uop_mix_microcode_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are x87 uops", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.X87@ / cpu_atom@TOPDOWN= _RETIRING.ALL@", + "MetricName": "tma_info_uop_mix_x87_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "MetricExpr": "tma_divider - tma_fp_divider", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_int_divider", + "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents overall Integer (Int) = select operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_int_vector_128b + tma_int_vector_256b", + "MetricGroup": "Pipeline;TopdownL3;Uops;tma_L3_group;tma_light_ope= rations_group", + "MetricName": "tma_int_operations", + "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operation= s > 0.6", + "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents 128-bit vector Integer= ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the= CPU has retired", + "MetricExpr": "INT_VEC_RETIRED.128BIT / (tma_retiring * tma_info_t= hread_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;Uops;tma_L4_g= roup;tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_128b", + "MetricThreshold": "tma_int_vector_128b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_int_vector_256b, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents 256-bit vector Integer= ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fraction= the CPU has retired", + "MetricExpr": "INT_VEC_RETIRED.256BIT / (tma_retiring * tma_info_t= hread_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;Uops;tma_L4_g= roup;tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_256b", + "MetricThreshold": "tma_int_vector_256b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_ports_utilized_= 2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", + "MetricExpr": "MEMORY_STALLS.L1 / tma_info_thread_clks", + "MetricGroup": "CacheHits;MemoryBound;Stalls;TmaL3mem;TopdownL3;tm= a_L3_group;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit Level 1 after missing Level 0 within the L1D= cache", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L1_HIT_L1 * cpu_core@MEM_LOAD= _RETIRED.L1_HIT_L1@R, MEM_LOAD_RETIRED.L1_HIT_L1 * 9) if 0 < cpu_core@MEM_L= OAD_RETIRED.L1_HIT_L1@R else MEM_LOAD_RETIRED.L1_HIT_L1 * 9) / tma_info_thr= ead_clks", + "MetricGroup": "BvML;Clocks_Retired;MemoryLat;TopdownL4;tma_L4_gro= up;tma_l1_bound_group", + "MetricName": "tma_l1_latency_capacity", + "MetricThreshold": "tma_l1_latency_capacity > 0.1 & tma_l1_bound >= 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", + "MetricExpr": "4 * DEPENDENT_LOADS.ANY / tma_info_thread_clks", + "MetricGroup": "BvML;Clocks_Estimated;MemoryLat;TopdownL4;tma_L4_g= roup;tma_l1_bound_group", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: DEPENDENT_LOADS.ANY", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", + "MetricExpr": "MEMORY_STALLS.L2 / tma_info_thread_clks", + "MetricGroup": "BvML;CacheHits;MemoryBound;Stalls;TmaL3mem;Topdown= L3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L2_HIT * cpu_core@MEM_LOAD_RE= TIRED.L2_HIT@R, MEM_LOAD_RETIRED.L2_HIT * (3 * tma_info_system_core_frequen= cy)) if 0 < cpu_core@MEM_LOAD_RETIRED.L2_HIT@R else MEM_LOAD_RETIRED.L2_HIT= * (3 * tma_info_system_core_frequency)) * (1 + MEM_LOAD_RETIRED.FB_HIT / M= EM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;MemoryLat;TopdownL4;tma_L4_group;tm= a_l2_bound_group", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled due to loads accesses to L3 cache or contended with a sibling Core", + "MetricExpr": "MEMORY_STALLS.L3 / tma_info_thread_clks", + "MetricGroup": "CacheHits;MemoryBound;Stalls;TmaL3mem;TopdownL3;tm= a_L3_group;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L3_HIT * cpu_core@MEM_LOAD_RE= TIRED.L3_HIT@R, MEM_LOAD_RETIRED.L3_HIT * (12 * tma_info_system_core_freque= ncy) - 3 * tma_info_system_core_frequency) if 0 < cpu_core@MEM_LOAD_RETIRED= .L3_HIT@R else MEM_LOAD_RETIRED.L3_HIT * (12 * tma_info_system_core_frequen= cy) - 3 * tma_info_system_core_frequency) * (1 + MEM_LOAD_RETIRED.FB_HIT / = MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "BvML;Clocks_Estimated;MemoryLat;TopdownL4;tma_L4_g= roup;tma_issueLat;tma_l3_bound_group;tma_overlap", + "MetricName": "tma_l3_hit_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "DECODE.LCP / tma_info_thread_clks", + "MetricGroup": "Clocks;FetchLat;TopdownL3;tma_L3_group;tma_fetch_l= atency_group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", + "MetricGroup": "Default;Retire;Slots;TmaL2;TopdownL2;tma_L2_group;= tma_retiring_group", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Load operations", + "MetricExpr": "UOPS_DISPATCHED.LOAD / (3 * tma_info_thread_clks)", + "MetricGroup": "Core_Execution;TopdownL5;tma_L5_group;tma_ports_ut= ilized_3m_group", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Load operations. Sample with: = UOPS_DISPATCHED.LOAD", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) DTLB was missed by load accesses, that late= r on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_dtlb_load - tma_load_stlb_miss)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL5;tma_L5_group;= tma_dtlb_load_group", + "MetricName": "tma_load_stlb_hit", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by load accesses, performing a= hardware page walk", + "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", + "MetricGroup": "Clocks_Calculated;MemoryTLB;TopdownL5;tma_L5_group= ;tma_dtlb_load_group", + "MetricName": "tma_load_stlb_miss", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_load_stlb_miss_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_load_stlb_miss_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_load_stlb_miss_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", + "MetricExpr": "MEM_INST_RETIRED.LOCK_LOADS * cpu_core@MEM_INST_RET= IRED.LOCK_LOADS@R / tma_info_thread_clks", + "MetricGroup": "Clocks;LockCont;Offcore;TopdownL4;tma_L4_group;tma= _issueRFO;tma_l1_bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to LSD (Loop Stream Detector) unit", + "MetricExpr": "cpu_core@LSD.UOPS\\,cmask\\=3D0x8\\,inv\\=3D0x1@ / = tma_info_thread_clks", + "MetricGroup": "FetchBW;LSD;Slots_Estimated;TopdownL3;tma_L3_group= ;tma_fetch_bandwidth_group", + "MetricName": "tma_lsd", + "MetricThreshold": "tma_lsd > 0.15 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Machine Clears", + "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts= )", + "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu_core@OFFCORE_REQUE= STS_OUTSTANDING.DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;Clocks;MemoryBW;Offcore;TopdownL4;tma_L4_grou= p;tma_dram_bound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_sq_full", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the performance was likely hurt due to latency from external memory - DRA= M ([SPR-HBM] and/or HBM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", + "MetricGroup": "BvML;Clocks;MemoryLat;Offcore;TopdownL4;tma_L4_gro= up;tma_dram_bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to memory reservation stalls in which a sched= uler is not able to accept uops", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.MEM_SCHEDULER@ / (8 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_mem_scheduler", + "MetricThreshold": "(tma_mem_scheduler >0.10) & ((tma_resource_bou= nd >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Backend;Default;Slots;TmaL2;TopdownL2;tma_L2_group= ;tma_backend_bound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "13 * MISC2_RETIRED.LFENCE / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", + "MetricName": "tma_memory_fence", + "MetricThreshold": "tma_memory_fence > 0.05 & tma_serializing_oper= ation > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "MetricExpr": "tma_light_operations * MEM_UOP_RETIRED.ANY / (tma_r= etiring * tma_info_thread_slots)", + "MetricGroup": "Pipeline;Slots;TopdownL3;tma_L3_group;tma_light_op= erations_group", + "MetricName": "tma_memory_operations", + "MetricThreshold": "tma_memory_operations > 0.1 & tma_light_operat= ions > 0.6", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "UOPS_RETIRED.MS / tma_info_thread_slots", + "MetricGroup": "MicroSeq;Slots;TopdownL3;tma_L3_group;tma_heavy_op= erations_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_bottleneck_i= rregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, t= ma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Branch Misprediction= at execution stage", + "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * INT_= MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", + "MetricGroup": "BadSpec;BrMispredicts;BvMP;Clocks;TopdownL4;tma_L4= _group;tma_branch_resteers_group;tma_issueBM", + "MetricName": "tma_mispredicts_resteers", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the MITE pipeline (the legacy deco= de pipeline)", + "MetricExpr": "(cpu_core@IDQ.MITE_UOPS\\,cmask\\=3D0x8\\,inv\\=3D0= x1@ / tma_info_thread_clks + IDQ.MITE_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS)= * (IDQ_BUBBLES.CYCLES_0_UOPS_DELIV.CORE - IDQ_BUBBLES.FETCH_LATENCY)) / tm= a_info_thread_clks", + "MetricGroup": "DSBmiss;FetchBW;Slots_Estimated;TopdownL3;tma_L3_g= roup;tma_fetch_bandwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sa= mple with: FRONTEND_RETIRED.ANY_DSB_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "MetricExpr": "160 * ASSISTS.SSE_AVX_MIX / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL5;tma_L5_group;tma_issueMV;tma_port= s_utilized_0_group", + "MetricName": "tma_mixing_vectors", + "MetricThreshold": "tma_mixing_vectors > 0.05", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "MetricExpr": "IDQ.MS_CYCLES_ANY / tma_info_thread_clks", + "MetricGroup": "MicroSeq;Slots_Estimated;TopdownL3;tma_L3_group;tm= a_fetch_bandwidth_group", + "MetricName": "tma_ms", + "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", + "MetricExpr": "3 * IDQ.MS_SWITCHES / tma_info_thread_clks", + "MetricGroup": "Clocks_Estimated;FetchLat;MicroSeq;TopdownL3;tma_L= 3_group;tma_fetch_latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_iss= ueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions that were not fused", + "MetricExpr": "tma_light_operations * (BR_INST_RETIRED.ALL_BRANCHE= S - INST_RETIRED.BR_FUSED) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Branches;BvBO;Pipeline;Slots;TopdownL3;tma_L3_grou= p;tma_light_operations_group", + "MetricName": "tma_non_fused_branches", + "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to IEC or FPC RAT stalls, which can be due to= FIQ or IEC reservation stalls in which the integer, floating point or SIMD= scheduler is not able to accept uops", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER@ / (8 *= cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_non_mem_scheduler", + "MetricThreshold": "(tma_non_mem_scheduler >0.10) & ((tma_resource= _bound >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring NOP (no op) instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", + "MetricGroup": "BvBO;Pipeline;Slots;TopdownL4;tma_L4_group;tma_oth= er_light_ops_group", + "MetricName": "tma_nop_instructions", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that requires the use of m= icrocode (slow nuke)", + "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.NUKE@ / (8 * cpu_a= tom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", + "MetricName": "tma_nuke", + "MetricThreshold": "(tma_nuke >0.05) & ((tma_machine_clears >0.05)= & ((tma_bad_speculation >0.15)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to other common frontend stalls not categor= ized.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.OTHER@ / (8 * cpu_atom@CP= U_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_other_fb", + "MetricThreshold": "(tma_other_fb >0.05) & ((tma_ifetch_bandwidth = >0.10) & ((tma_frontend_bound >0.20)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents the remaining light uo= ps fraction the CPU has executed - remaining means not covered by other sib= ling nodes", + "MetricExpr": "max(0, tma_light_operations - (tma_x87_use + (FP_AR= ITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIRED.VECTOR) / (tma_retiring * t= ma_info_thread_slots) + (INT_VEC_RETIRED.ADD_128 + INT_VEC_RETIRED.VNNI_128= + INT_VEC_RETIRED.ADD_256 + INT_VEC_RETIRED.MUL_256 + INT_VEC_RETIRED.VNNI= _256) / (tma_retiring * tma_info_thread_slots) + tma_memory_operations + tm= a_fused_instructions + tma_non_fused_branches))", + "MetricGroup": "Pipeline;Slots;TopdownL3;tma_L3_group;tma_light_op= erations_group", + "MetricName": "tma_other_light_ops", + "MetricThreshold": "tma_other_light_ops > 0.3 & tma_light_operatio= ns > 0.6", + "PublicDescription": "This metric represents the remaining light u= ops fraction the CPU has executed - remaining means not covered by other si= bling nodes. May undercount due to FMA double counting", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", + "MetricGroup": "BrMispredicts;BvIO;Slots;TopdownL3;tma_L3_group;tm= a_branch_mispredicts_group", + "MetricName": "tma_other_mispredicts", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", + "MetricGroup": "BvIO;Machine_Clears;Slots;TopdownL3;tma_L3_group;t= ma_machine_clears_group", + "MetricName": "tma_other_nukes", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Page Faults", + "MetricExpr": "99 * ASSISTS.PAGE_FAULT / tma_info_thread_slots", + "MetricGroup": "Slots_Estimated;TopdownL5;tma_L5_group;tma_assists= _group", + "MetricName": "tma_page_faults", + "MetricThreshold": "tma_page_faults > 0.05", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", + "MetricExpr": "((EXE_ACTIVITY.EXE_BOUND_0_PORTS + (EXE_ACTIVITY.1_= PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_3_PORTS_UTIL)) / tma_info_thread= _clks if ARITH.DIV_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - EXE_ACTIVITY.BOUN= D_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_= 3_PORTS_UTIL) / tma_info_thread_clks)", + "MetricGroup": "Clocks;PortsUtil;TopdownL3;tma_L3_group;tma_core_b= ound_group", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.EXE_BOUND_0_PORTS / tma_info_thread_cl= ks", + "MetricGroup": "Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_ports_= utilization_group", + "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", + "MetricGroup": "Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_issueL= 1;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_1", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", + "MetricGroup": "Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_issue2= P;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma= _int_vector_256b", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", + "MetricGroup": "BvCB;Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_p= orts_utilization_group", + "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to wrong predecodes.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.PREDECODE@ / (8 * cpu_ato= m@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_predecode", + "MetricThreshold": "(tma_predecode >0.05) & ((tma_ifetch_bandwidth= >0.10) & ((tma_frontend_bound >0.20)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the physical register file unable to accep= t an entry (marble stalls)", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REGISTER@ / (8 * cpu_atom= @CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_register", + "MetricThreshold": "(tma_register >0.10) & ((tma_resource_bound >0= .20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the reorder buffer being full (ROB stalls)= ", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REORDER_BUFFER@ / (8 * cp= u_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_reorder_buffer", + "MetricThreshold": "(tma_reorder_buffer >0.10) & ((tma_resource_bo= und >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to a resource limitation", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALL_P@ / (8 * cpu_atom@CP= U_CLK_UNHALTED.CORE@) - tma_core_bound", + "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_resource_bound", + "MetricThreshold": "(tma_resource_bound >0.20) & ((tma_backend_bou= nd >0.10))", + "MetricgroupNoGroup": "TopdownL2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by (indirect) RET instruction= s", + "MetricExpr": "BR_MISP_RETIRED.RET_COST * cpu_core@BR_MISP_RETIRED= .RET_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_ret_mispredicts", + "MetricThreshold": "tma_ret_mispredicts > 0.05 & tma_branch_mispre= dicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots u= tilized by useful work i.e. issued uops that eventually get retired. Ideall= y; all pipeline slots would be attributed to the Retiring category. Retiri= ng of 100% would indicate the maximum Pipeline_Width throughput was achieve= d. Maximizing Retiring typically increases the Instructions-per-cycle (see= IPC metric). Note that a high Retiring value does not necessary mean there= is no room for more performance. For example; Heavy-operations or Microco= de Assists are categorized under Retiring. They often indicate suboptimal p= erformance and can often be optimized or avoided. Sample with: UOPS_RETIRED= .SLOTS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to scoreboards from the instruction queue (IQ= ), jump execution unit (JEU), or microcode sequencer (MS)", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.SERIALIZATION@ / (8 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_serialization", + "MetricThreshold": "(tma_serialization >0.10) & ((tma_resource_bou= nd >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU issue-pipeline was stalled due to serializing operations", + "MetricExpr": "(BE_STALLS.SCOREBOARD + CPU_CLK_UNHALTED.C02) / tma= _info_thread_clks", + "MetricGroup": "BvIO;Clocks;PortsUtil;TopdownL3;tma_L3_group;tma_c= ore_bound_group;tma_issueSO", + "MetricName": "tma_serializing_operation", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: BE_STALLS.SCOREBOARD. Related metrics: tm= a_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring Shuffle operations of 256-bit vector size (FP or Int= eger)", + "MetricExpr": "tma_light_operations * INT_VEC_RETIRED.SHUFFLES / (= tma_retiring * tma_info_thread_slots)", + "MetricGroup": "HPC;Pipeline;Slots;TopdownL4;tma_L4_group;tma_othe= r_light_ops_group", + "MetricName": "tma_shuffles_256b", + "MetricThreshold": "tma_shuffles_256b > 0.1 & tma_other_light_ops = > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to PAUSE Instructions", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "CPU_CLK_UNHALTED.PAUSE / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", + "MetricName": "tma_slow_pause", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.= PAUSE_INST", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles hand= ling memory load split accesses - load that cross 64-byte cache line bounda= ry", + "MetricExpr": "(min(MEM_INST_RETIRED.SPLIT_LOADS * cpu_core@MEM_IN= ST_RETIRED.SPLIT_LOADS@R, MEM_INST_RETIRED.SPLIT_LOADS * tma_info_memory_lo= ad_miss_real_latency) if 0 < cpu_core@MEM_INST_RETIRED.SPLIT_LOADS@R else M= EM_INST_RETIRED.SPLIT_LOADS * tma_info_memory_load_miss_real_latency) / tma= _info_thread_clks", + "MetricGroup": "Clocks_Calculated;TopdownL4;tma_L4_group;tma_l1_bo= und_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents rate of split store ac= cesses", + "MetricExpr": "(min(MEM_INST_RETIRED.SPLIT_STORES * cpu_core@MEM_I= NST_RETIRED.SPLIT_STORES@R, MEM_INST_RETIRED.SPLIT_STORES) if 0 < cpu_core@= MEM_INST_RETIRED.SPLIT_STORES@R else MEM_INST_RETIRED.SPLIT_STORES) / tma_i= nfo_thread_clks", + "MetricGroup": "Core_Utilization;TopdownL4;tma_L4_group;tma_issueS= pSt;tma_store_bound_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", + "MetricExpr": "(XQ.FULL + L1D_MISS.L2_STALLS) / tma_info_thread_cl= ks", + "MetricGroup": "BvMB;Clocks;MemoryBW;Offcore;TopdownL4;tma_L4_grou= p;tma_issueBW;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_mem_bandwidth", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often CPU was stall= ed due to RFO store memory accesses; RFO store issue a read-for-ownership = request before the write", + "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", + "MetricGroup": "MemoryBound;Stalls;TmaL3mem;TopdownL3;tma_L3_group= ;tma_memory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates fraction of cyc= les when the memory subsystem had loads blocked since they could not forwar= d data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", + "MetricGroup": "Clocks_Estimated;TopdownL4;tma_L4_group;tma_l1_bou= nd_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", + "MetricExpr": "(MEM_STORE_RETIRED.L2_HIT * 10 * (1 - MEM_INST_RETI= RED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_= LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE= _REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", + "MetricGroup": "BvML;Clocks_Estimated;LockCont;MemoryLat;Offcore;T= opdownL4;tma_L4_group;tma_issueRFO;tma_issueSL;tma_overlap;tma_store_bound_= group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Store operations", + "MetricExpr": "(UOPS_DISPATCHED.STD + UOPS_DISPATCHED.STA) / (7 * = tma_info_thread_clks)", + "MetricGroup": "Core_Execution;TopdownL5;tma_L5_group;tma_ports_ut= ilized_3m_group", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Store operations. Sample with:= UOPS_DISPATCHED.STD, UOPS_DISPATCHED.STA", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the TLB was missed by store accesses, hitting in the second-l= evel TLB (STLB)", + "MetricExpr": "max(0, tma_dtlb_store - tma_store_stlb_miss)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL5;tma_L5_group;= tma_dtlb_store_group", + "MetricName": "tma_store_stlb_hit", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the STLB was missed by store accesses, performing a hardware page wal= k", + "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_thread_clk= s", + "MetricGroup": "Clocks_Calculated;MemoryTLB;TopdownL5;tma_L5_group= ;tma_dtlb_store_group", + "MetricName": "tma_store_stlb_miss", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_store_stlb_miss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_store_stlb_miss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_store_stlb_miss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often CPU was stall= ed due to Streaming store memory accesses; Streaming store optimize out a = read request required by RFO stores", + "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", + "MetricGroup": "Clocks_Estimated;MemoryBW;Offcore;TopdownL4;tma_L4= _group;tma_issueSmSt;tma_store_bound_group", + "MetricName": "tma_streaming_stores", + "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to new branch address clears", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / tma_info_thread_cl= ks", + "MetricGroup": "BigFootprint;BvBC;Clocks;FetchLat;TopdownL4;tma_L4= _group;tma_branch_resteers_group", + "MetricName": "tma_unknown_branches", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric serves as an approximation of leg= acy x87 usage", + "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", + "MetricGroup": "Compute;TopdownL4;Uops;tma_L4_group;tma_fp_arith_g= roup", + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "MetricExpr": "UOPS_DISPATCHED.ALU / (6 * tma_info_thread_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.4", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops delivered by the Microcode_Sequencer as a result of Assists= ", + "MetricExpr": "78 * ASSISTS.ANY / tma_info_thread_slots", + "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists", + "MetricExpr": "63 * ASSISTS.SSE_AVX_MIX / tma_info_thread_slots", + "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", + "MetricName": "tma_avx_assists", + "MetricThreshold": "tma_avx_assists > 0.1", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", + "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "BvOB;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_backend_bound", + "MetricThreshold": "tma_backend_bound > 0.2", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This category represents fraction of slots wa= sted due to incorrect speculations", + "MetricExpr": "topdown\\-bad\\-spec / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_bad_speculation", + "MetricThreshold": "tma_bad_speculation > 0.15", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Ret", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_l1_latency_capacity + tma_lock_latency + tma_= split_loads + tma_fb_full)))", + "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_mem_b= andwidth, tma_sq_full", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_l1_latency_capacity + tma_lock_latency + tma_split_loads + t= ma_fb_full)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bo= und + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_latency_c= apacity / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + = tma_l1_latency_capacity + tma_lock_latency + tma_split_loads + tma_fb_full)= ) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l= 3_bound + tma_dram_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtl= b_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma_l1_latency_cap= acity + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bou= nd * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_split_loads / (tma_dtlb_load + tma_store_= fwd_blk + tma_l1_latency_dependency + tma_l1_latency_capacity + tma_lock_la= tency + tma_split_loads + tma_fb_full)) + tma_memory_bound * (tma_store_bou= nd / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_sto= re_bound)) * (tma_split_stores / (tma_store_latency + tma_false_sharing + t= ma_split_stores + tma_streaming_stores + tma_dtlb_store)) + tma_memory_boun= d * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_store_latency / (tma_store_latency + tm= a_false_sharing + tma_split_stores + tma_streaming_stores + tma_dtlb_store)= ))", + "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D0x1@) * (t= ma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_res= teers + tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispr= edicts) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_bra= nches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma_= ms_switches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms / = (tma_mite + tma_dsb + tma_lsd + tma_ms))) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Fed;FetchBW;Frontend", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu_core@U= OPS_RETIRED.MS\\,cmask\\=3D0x1@) * (tma_fetch_latency * (tma_ms_switches + = tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma= _other_mispredicts / tma_branch_mispredicts) / (tma_mispredicts_resteers + = tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_itl= b_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switch= es) + tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_lsd + tma_ms= )) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / t= ma_other_nukes + tma_core_bound * (tma_serializing_operation + RS.EMPTY_RES= OURCE / tma_info_thread_clks * tma_ports_utilized_0) / (tma_divider + tma_s= erializing_operation + tma_ports_utilization) + tma_microcode_sequencer / (= tma_microcode_sequencer + tma_few_uops_instructions) * (tma_assists / tma_m= icrocode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (= tma_dtlb_load / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_depende= ncy + tma_l1_latency_capacity + tma_lock_latency + tma_split_loads + tma_fb= _full)) + tma_memory_bound * (tma_store_bound / (tma_l1_bound + tma_l2_boun= d + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_dtlb_store / (= tma_store_latency + tma_false_sharing + tma_split_stores + tma_streaming_st= ores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Offcore", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_microcode_sequencer + tma_few_uops_instr= uctions) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Ret", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", + "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * sl= ots", + "MetricGroup": "BadSpec;BrMispredicts;BvMP;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueBM", + "MetricName": "tma_branch_mispredicts", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost,= tma_mispredicts_resteers", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers", + "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C01 / tma_info_thread_clks", + "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", + "MetricName": "tma_c01_wait", + "MetricThreshold": "tma_c01_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C02 / tma_info_thread_clks", + "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", + "MetricName": "tma_c02_wait", + "MetricThreshold": "tma_c02_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU retired uops originated from CISC (complex instruction set computer) in= struction", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Machine Clears", + "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", + "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", + "MetricName": "tma_clears_resteers", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "MetricExpr": "max(0, FRONTEND_RETIRED.L1I_MISS * cpu_core@FRONTEN= D_RETIRED.L1I_MISS@R / tma_info_thread_clks - tma_code_l2_miss)", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_hit", + "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "MetricExpr": "FRONTEND_RETIRED.L2_MISS * cpu_core@FRONTEND_RETIRE= D.L2_MISS@R / tma_info_thread_clks", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_miss", + "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, FRONTEND_RETIRED.ITLB_MISS * cpu_core@FRONTE= ND_RETIRED.ITLB_MISS@R / tma_info_thread_clks - tma_code_stlb_miss)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "FRONTEND_RETIRED.STLB_MISS * cpu_core@FRONTEND_RETI= RED.STLB_MISS@R / tma_info_thread_clks", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks * IT= LB_MISSES.WALK_COMPLETED_2M_4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISS= ES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks * IT= LB_MISSES.WALK_COMPLETED_4K / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.= WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by non-taken conditional bran= ches", + "MetricExpr": "BR_MISP_RETIRED.COND_NTAKEN_COST * cpu_core@BR_MISP= _RETIRED.COND_NTAKEN_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_cond_nt_mispredicts", + "MetricThreshold": "tma_cond_nt_mispredicts > 0.05 & tma_branch_mi= spredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by backward-taken conditional branche= s", + "MetricExpr": "BR_MISP_RETIRED.COND_TAKEN_BWD_COST * cpu_core@BR_M= ISP_RETIRED.COND_TAKEN_BWD_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_cond_tk_bwd_mispredicts", + "MetricThreshold": "tma_cond_tk_bwd_mispredicts > 0.05 & tma_branc= h_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by forward-taken conditional branches= ", + "MetricExpr": "BR_MISP_RETIRED.COND_TAKEN_FWD_COST * cpu_core@BR_M= ISP_RETIRED.COND_TAKEN_FWD_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_cond_tk_fwd_mispredicts", + "MetricThreshold": "tma_cond_tk_fwd_mispredicts > 0.05 & tma_branc= h_mispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", + "MetricExpr": "((min(MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS * cpu_core@= MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS * (2= 7 * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequency) i= f 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@R else MEM_LOAD_L3_HIT_RET= IRED.XSNP_MISS * (27 * tma_info_system_core_frequency) - 3 * tma_info_syste= m_core_frequency) + (min(MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * cpu_core@MEM_L= OAD_L3_HIT_RETIRED.XSNP_HITM@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (28 * t= ma_info_system_core_frequency) - 3 * tma_info_system_core_frequency) if 0 <= cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM@R else MEM_LOAD_L3_HIT_RETIRED.= XSNP_HITM * (28 * tma_info_system_core_frequency) - 3 * tma_info_system_cor= e_frequency)) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2= ) / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_contested_accesses", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false= _sharing, tma_machine_clears", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e Core non-memory issues were of a bottleneck", + "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)", + "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_b= ackend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", + "MetricExpr": "((min(MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD * cpu_cor= e@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FW= D * (27 * tma_info_system_core_frequency) - 3 * tma_info_system_core_freque= ncy) if 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@R else MEM_LOAD_L3= _HIT_RETIRED.XSNP_NO_FWD * (27 * tma_info_system_core_frequency) - 3 * tma_= info_system_core_frequency) + (min(MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * cpu_c= ore@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * = (28 * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequency)= if 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@R else MEM_LOAD_L3_HIT_RE= TIRED.XSNP_FWD * (28 * tma_info_system_core_frequency) - 3 * tma_info_syste= m_core_frequency)) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MIS= S / 2) / tma_info_thread_clks", + "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Divider unit was active", + "MetricExpr": "ARITH.DIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIV_ACTIVE", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled on accesses to external memory (DRAM) by loads", + "MetricExpr": "MEMORY_STALLS.MEM / tma_info_thread_clks", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to DSB (decoded uop cache) fetch pipe= line", + "MetricExpr": "(cpu_core@IDQ.DSB_UOPS\\,cmask\\=3D0x8\\,inv\\=3D0x= 1@ + IDQ.DSB_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS) * (IDQ_BUBBLES.CYCLES_0_= UOPS_DELIV.CORE - IDQ_BUBBLES.FETCH_LATENCY)) / tma_info_thread_clks", + "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", + "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricExpr": "(min(MEM_INST_RETIRED.STLB_HIT_LOADS * cpu_core@MEM= _INST_RETIRED.STLB_HIT_LOADS@R, MEM_INST_RETIRED.STLB_HIT_LOADS * 7) if 0 <= cpu_core@MEM_INST_RETIRED.STLB_HIT_LOADS@R else MEM_INST_RETIRED.STLB_HIT_= LOADS * 7) / tma_info_thread_clks + tma_load_stlb_miss", + "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", + "MetricExpr": "(min(MEM_INST_RETIRED.STLB_HIT_STORES * cpu_core@ME= M_INST_RETIRED.STLB_HIT_STORES@R, MEM_INST_RETIRED.STLB_HIT_STORES * 7) if = 0 < cpu_core@MEM_INST_RETIRED.STLB_HIT_STORES@R else MEM_INST_RETIRED.STLB_= HIT_STORES * 7) / tma_info_thread_clks + tma_store_stlb_miss", + "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", + "MetricExpr": "28 * tma_info_system_core_frequency * OCR.DEMAND_RF= O.L3_HIT.SNOOP_HITM / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", + "MetricName": "tma_false_sharing", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", + "MetricExpr": "L1D_MISS.FB_FULL / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend bandwidth issues", + "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)", + "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_= frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.2", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Rel= ated metrics: tma_dsb_switches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_= botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_inst_mix_ipt= b, tma_lcp", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend latency issues", + "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend= _bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", + "MetricExpr": "max(0, tma_heavy_operations - tma_microcode_sequenc= er)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", + "MetricName": "tma_few_uops_instructions", + "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents overall arithmetic flo= ating-point (FP) operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", + "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", + "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Floating Point (FP) Assists", + "MetricExpr": "30 * ASSISTS.FP / tma_info_thread_slots", + "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", + "MetricName": "tma_fp_assists", + "MetricThreshold": "tma_fp_assists > 0.1", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "MetricExpr": "ARITH.FPDIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_fp_divider", + "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) scalar uops fraction the CPU has retired", + "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / (tma_retiring * tma_= info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", + "MetricName": "tma_fp_scalar", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b, tma_ports_utili= zed_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) vector uops fraction the CPU has retired aggregated across all v= ector widths", + "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / (tma_retiring * tma_= info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", + "MetricName": "tma_fp_vector", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, = tma_int_vector_256b, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 128-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", + "MetricName": "tma_fp_vector_128b", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b,= tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 256-bit wide vectors", + "MetricExpr": "cpu_core@FP_ARITH_INST_RETIRED.VECTOR\\,umask\\=3D0= x30@ / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", + "MetricName": "tma_fp_vector_256b", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_int_vector_128b, tma_int_vector_256b,= tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This category represents fraction of slots wh= ere the processor's Frontend undersupplies its Backend", + "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "BvFB;BvIO;PGO;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.MACRO_FUSED / (= tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", + "MetricName": "tma_fused_instructions", + "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+]). Sample with: UOPS_RET= IRED.HEAVY", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", + "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", + "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect CALL instructions= ", + "MetricExpr": "BR_MISP_RETIRED.INDIRECT_CALL_COST * cpu_core@BR_MI= SP_RETIRED.INDIRECT_CALL_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_ind_call_mispredicts", + "MetricThreshold": "tma_ind_call_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect JMP instructions", + "MetricExpr": "max((BR_MISP_RETIRED.INDIRECT_COST * cpu_core@BR_MI= SP_RETIRED.INDIRECT_COST@R - BR_MISP_RETIRED.INDIRECT_CALL_COST * cpu_core@= BR_MISP_RETIRED.INDIRECT_CALL_COST@R) / tma_info_thread_clks, 0)", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_ind_jump_mispredicts", + "MetricThreshold": "tma_ind_jump_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 8 / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "MetricGroup": "Bad;BrMispredicts;tma_issueBM", + "MetricName": "tma_info_bad_spec_branch_misprediction_cost", + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", + "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for cond= itional backward-taken branches (lower number means higher occurrence rate)= ", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN_BWD", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_cond_taken_bwd", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for cond= itional forward-taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN_FWD", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_cond_taken_fwd", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_indirect", + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", + "MetricGroup": "Bad;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmisp_ret", + "MetricThreshold": "tma_info_bad_spec_ipmisp_ret < 500", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts", + "MetricName": "tma_info_bad_spec_ipmispredict", + "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", + "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", + "MetricGroup": "BrMispredicts", + "MetricName": "tma_info_bad_spec_spec_clears_ratio", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_lsd + tma_ms)))", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", + "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd + tma_ms))", + "MetricGroup": "DSBmiss;Fed;tma_issueFB", + "MetricName": "tma_info_botlnk_l2_dsb_misses", + "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) misse= s - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb= _switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", + "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", + "MetricName": "tma_info_botlnk_l2_ic_misses", + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of branches that are CALL or RET", + "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_R= ETURN) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_branches_callret", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of branches that are non-taken condi= tionals", + "MetricExpr": "BR_INST_RETIRED.COND_NTAKEN / BR_INST_RETIRED.ALL_B= RANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_branches_cond_nt", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of branches that are forward taken c= onditionals", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN_BWD / BR_INST_RETIRED.AL= L_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_branches_cond_tk_bwd", + "MetricThreshold": "tma_info_branches_cond_tk_bwd > 0.3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of branches that are forward taken c= onditionals", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN_FWD / BR_INST_RETIRED.AL= L_BRANCHES", + "MetricGroup": "Bad;Branches;CodeGen;PGO", + "MetricName": "tma_info_branches_cond_tk_fwd", + "MetricThreshold": "tma_info_branches_cond_tk_fwd > 0.2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of branches that are unconditional (= direct or indirect) jumps", + "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_= TAKEN_BWD - BR_INST_RETIRED.COND_TAKEN_FWD - 2 * BR_INST_RETIRED.NEAR_CALL)= / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_branches_jump", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of branches of other types (not indi= vidually covered by other metrics in Info.Branches group)", + "MetricExpr": "1 - (tma_info_branches_cond_nt + tma_info_branches_= cond_tk_bwd + tma_info_branches_cond_tk_fwd + tma_info_branches_callret + t= ma_info_branches_jump)", + "MetricGroup": "Bad;Branches", + "MetricName": "tma_info_branches_other_branches", + "Unit": "cpu_core" + }, + { + "BriefDescription": "uops Executed per Cycle", + "MetricExpr": "UOPS_EXECUTED.THREAD / tma_info_thread_clks", + "MetricGroup": "Power", + "MetricName": "tma_info_core_epc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Floating Point Operations Per Cycle", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / tma_info_thread_clks", + "MetricGroup": "Flops;Ret", + "MetricName": "tma_info_core_flopc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Actual per-core usage of the Floating Point n= on-X87 execution units (regardless of precision or vector-width)", + "MetricExpr": "(FP_ARITH_DISPATCHED.V0 + FP_ARITH_DISPATCHED.V1 + = FP_ARITH_DISPATCHED.V2 + FP_ARITH_DISPATCHED.V3) / (4 * tma_info_thread_clk= s)", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_core_fp_arith_utilization", + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu_core@UOPS_EXECUTED.THREA= D\\,cmask\\=3D0x1@", + "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", + "MetricName": "tma_info_core_ilp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of Uops delivered by the DSB (aka De= coded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", + "MetricName": "tma_info_frontend_dsb_coverage", + "MetricThreshold": "tma_info_frontend_dsb_coverage < 0.7 & tma_inf= o_thread_ipc / 8 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu_core@DSB2MIT= E_SWITCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricGroup": "DSBmiss", + "MetricName": "tma_info_frontend_dsb_switch_cost", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired DSB misses", + "MetricExpr": "FRONTEND_RETIRED.ANY_DSB_MISS * cpu_core@FRONTEND_R= ETIRED.ANY_DSB_MISS@R / tma_info_thread_clks", + "MetricGroup": "DSBmiss;Fed;FetchLat", + "MetricName": "tma_info_frontend_dsb_switches_ret", + "MetricThreshold": "tma_info_frontend_dsb_switches_ret > 0.05", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", + "MetricExpr": "UOPS_ISSUED.ANY / cpu_core@UOPS_ISSUED.ANY\\,cmask\= \=3D0x1@", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_frontend_fetch_upc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average Latency for L1 instruction cache miss= es", + "MetricExpr": "ICACHE_DATA.STALLS / ICACHE_DATA.STALL_PERIODS", + "MetricGroup": "Fed;FetchLat;IcMiss", + "MetricName": "tma_info_frontend_icache_miss_latency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per non-speculative DSB miss (lo= wer number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", + "MetricGroup": "DSBmiss;Fed", + "MetricName": "tma_info_frontend_ipdsb_miss_ret", + "MetricThreshold": "tma_info_frontend_ipdsb_miss_ret < 50", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per speculative Unknown Branch M= isprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_inst_mix_instructions / BACLEARS.ANY", + "MetricGroup": "Fed", + "MetricName": "tma_info_frontend_ipunknown_branch", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache true code cacheline misses per kilo = instruction", + "MetricExpr": "1e3 * FRONTEND_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_frontend_l2mpki_code", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache speculative code cacheline misses pe= r kilo instruction", + "MetricExpr": "1e3 * L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss", + "MetricName": "tma_info_frontend_l2mpki_code_all", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of Uops delivered by the LSD (Loop S= tream Detector; aka Loop Cache)", + "MetricExpr": "LSD.UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "Fed;LSD", + "MetricName": "tma_info_frontend_lsd_coverage", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired operations that invoke th= e Microcode Sequencer", + "MetricExpr": "FRONTEND_RETIRED.MS_FLOWS * cpu_core@FRONTEND_RETIR= ED.MS_FLOWS@R / tma_info_thread_clks", + "MetricGroup": "Fed;FetchLat;MicroSeq", + "MetricName": "tma_info_frontend_ms_latency_ret", + "MetricThreshold": "tma_info_frontend_ms_latency_ret > 0.05", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of cycles the front-end was de= layed due to an Unknown Branch detection", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu_core@INT_MISC.= UNKNOWN_BRANCH_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricGroup": "Fed", + "MetricName": "tma_info_frontend_unknown_branch_cost", + "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired branches who got branch a= ddress clears", + "MetricExpr": "FRONTEND_RETIRED.UNKNOWN_BRANCH * cpu_core@FRONTEND= _RETIRED.UNKNOWN_BRANCH@R / tma_info_thread_clks", + "MetricGroup": "Fed;FetchLat", + "MetricName": "tma_info_frontend_unknown_branches_ret", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Branch instructions per taken branch", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_inst_mix_bptkbranch", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Summary;TmaL1;tma_L1_group", + "MetricName": "tma_info_inst_mix_instructions", + "PublicDescription": "Total number of retired Instructions. Sample= with: INST_RETIRED.PREC_DIST", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per FP Arithmetic instruction (l= ower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = FP_ARITH_INST_RETIRED.VECTOR)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_inst_mix_iparith", + "MetricThreshold": "tma_info_inst_mix_iparith < 10", + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_inst_mix_iparith_avx128", + "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_inst_mix_iparith_avx256", + "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOU= BLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_inst_mix_iparith_scalar_dp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SIN= GLE", + "MetricGroup": "Flops;FpScalar;InsType", + "MetricName": "tma_info_inst_mix_iparith_scalar_sp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Branches;Fed;InsType", + "MetricName": "tma_info_inst_mix_ipbranch", + "MetricThreshold": "tma_info_inst_mix_ipbranch < 8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per (near) call (lower number me= ans higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", + "MetricGroup": "Branches;Fed;PGO", + "MetricName": "tma_info_inst_mix_ipcall", + "MetricThreshold": "tma_info_inst_mix_ipcall < 200", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per Floating Point (FP) Operatio= n (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_= FLOPS + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;InsType", + "MetricName": "tma_info_inst_mix_ipflop", + "MetricThreshold": "tma_info_inst_mix_ipflop < 10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per Load (lower number means hig= her occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", + "MetricGroup": "InsType", + "MetricName": "tma_info_inst_mix_ipload", + "MetricThreshold": "tma_info_inst_mix_ipload < 3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per PAUSE (lower number means hi= gher occurrence rate)", + "MetricExpr": "tma_info_inst_mix_instructions / CPU_CLK_UNHALTED.P= AUSE_INST", + "MetricGroup": "Flops;FpVector;InsType", + "MetricName": "tma_info_inst_mix_ippause", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per Store (lower number means hi= gher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", + "MetricGroup": "InsType", + "MetricName": "tma_info_inst_mix_ipstore", + "MetricThreshold": "tma_info_inst_mix_ipstore < 8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_SWPF", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_inst_mix_ipswpf", + "MetricThreshold": "tma_info_inst_mix_ipswpf < 100", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per taken branch", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", + "MetricName": "tma_info_inst_mix_iptb", + "MetricThreshold": "tma_info_inst_mix_iptb < 8 * 2 + 1", + "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fill Buffer (FB) hits per kilo instructions f= or retired demand loads (L1D misses that merge into ongoing miss-handling e= ntries)", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_fb_hpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_l1d_cache_fill_bw", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= Level 0 within L1D cache [GB / sec]", + "MetricExpr": "64 * L1D.L0_REPLACEMENT / 1e9 / tma_info_system_tim= e", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_l1dl0_cache_fill_bw", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L1 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l1mpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L1 cache true misses per kilo instruction for= all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.AN= Y", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l1mpki_load", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_l2_cache_fill_bw", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache hits per kilo instruction for all re= quest types (including speculative)", + "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_= RETIRED.ANY", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l2hpki_all", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache hits per kilo instruction for all de= mand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.AN= Y", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l2hpki_load", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "Backend;CacheHits;Mem", + "MetricName": "tma_info_memory_l2mpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all request types (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Offcore", + "MetricName": "tma_info_memory_l2mpki_all", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.A= NY", + "MetricGroup": "CacheHits;Mem", + "MetricName": "tma_info_memory_l2mpki_load", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Offcore requests (L2 cache miss) per kilo ins= truction for demand RFOs", + "MetricExpr": "1e3 * L2_RQSTS.RFO_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Offcore", + "MetricName": "tma_info_memory_l2mpki_rfo", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", + "MetricGroup": "Mem;MemoryBW;Offcore", + "MetricName": "tma_info_memory_l3_cache_access_bw", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", + "MetricGroup": "Mem;MemoryBW", + "MetricName": "tma_info_memory_l3_cache_fill_bw", + "Unit": "cpu_core" + }, + { + "BriefDescription": "L3 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", + "MetricGroup": "Mem", + "MetricName": "tma_info_memory_l3mpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DATA_RD / OFFCORE_REQU= ESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_memory_latency_data_l2_mlp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average Latency for L2 cache miss demand Load= s", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "LockCont;Memory_Lat;Offcore", + "MetricName": "tma_info_memory_latency_load_l2_miss_latency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu_c= ore@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", + "MetricGroup": "Memory_BW;Offcore", + "MetricName": "tma_info_memory_latency_load_l2_mlp", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average Latency for L3 cache miss demand Load= s", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAND_DATA_RD= / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", + "MetricGroup": "Memory_Lat;Offcore", + "MetricName": "tma_info_memory_latency_load_l3_miss_latency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Actual Average Latency for L1 data-cache miss= demand load operations (in core cycles)", + "MetricExpr": "L1D_PENDING.LOAD / L1D_MISS.LOAD", + "MetricGroup": "Mem;MemoryBound;MemoryLat", + "MetricName": "tma_info_memory_load_miss_real_latency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "\"Bus lock\" per kilo instruction", + "MetricExpr": "1e3 * SQ_MISC.BUS_LOCK / INST_RETIRED.ANY", + "MetricGroup": "Mem", + "MetricName": "tma_info_memory_mix_bus_lock_pki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Un-cacheable retired load per kilo instructio= n", + "MetricExpr": "1e3 * MEM_LOAD_MISC_RETIRED.UC / INST_RETIRED.ANY", + "MetricGroup": "Mem", + "MetricName": "tma_info_memory_mix_uc_load_pki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Memory-Level-Parallelism (average number of L= 1 miss demand load when there is at least one such miss", + "MetricExpr": "L1D_PENDING.LOAD / L1D_PENDING.LOAD_CYCLES", + "MetricGroup": "Mem;MemoryBW;MemoryBound", + "MetricName": "tma_info_memory_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Rate of L2 HW prefetched lines that were not = used by demand accesses", + "MetricExpr": "L2_LINES_OUT.USELESS_HWPF / (L2_LINES_OUT.SILENT + = L2_LINES_OUT.NON_SILENT)", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_memory_prefetches_useless_hwpf", + "MetricThreshold": "tma_info_memory_prefetches_useless_hwpf > 0.15= ", + "Unit": "cpu_core" + }, + { + "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", + "MetricGroup": "Fed;MemoryTLB", + "MetricName": "tma_info_memory_tlb_code_stlb_mpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand loads", + "MetricExpr": "MEM_INST_RETIRED.STLB_MISS_LOADS * cpu_core@MEM_INS= T_RETIRED.STLB_MISS_LOADS@R / tma_info_thread_clks", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_load_stlb_miss_ret", + "MetricThreshold": "tma_info_memory_tlb_load_stlb_miss_ret > 0.05", + "Unit": "cpu_core" + }, + { + "BriefDescription": "STLB (2nd level TLB) data load speculative mi= sses per kilo instruction (misses of any page-size that complete the page w= alk)", + "MetricExpr": "1e3 * DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRE= D.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_load_stlb_mpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Utilization of the core's Page Walker(s) serv= ing STLB misses triggered by instruction/Load/Store accesses", + "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_P= ENDING + DTLB_STORE_MISSES.WALK_PENDING) / (4 * tma_info_thread_clks)", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_page_walks_utilization", + "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand stores", + "MetricExpr": "MEM_INST_RETIRED.STLB_MISS_STORES * cpu_core@MEM_IN= ST_RETIRED.STLB_MISS_STORES@R / tma_info_thread_clks", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_store_stlb_miss_ret", + "MetricThreshold": "tma_info_memory_tlb_store_stlb_miss_ret > 0.05= ", + "Unit": "cpu_core" + }, + { + "BriefDescription": "STLB (2nd level TLB) data store speculative m= isses per kilo instruction (misses of any page-size that complete the page = walk)", + "MetricExpr": "1e3 * DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIR= ED.ANY", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_store_stlb_mpki", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of uops fetched from DSB per c= ycle", + "MetricExpr": "IDQ.DSB_UOPS / IDQ.DSB_CYCLES_ANY", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_pipeline_fetch_dsb", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of uops fetched from LSD per c= ycle", + "MetricExpr": "LSD.UOPS / LSD.CYCLES_ACTIVE", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_pipeline_fetch_lsd", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of uops fetched from MITE per = cycle", + "MetricExpr": "IDQ.MITE_UOPS / IDQ.MITE_CYCLES_ANY", + "MetricGroup": "Fed;FetchBW", + "MetricName": "tma_info_pipeline_fetch_mite", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per a microcode Assist invocatio= n", + "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", + "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", + "MetricName": "tma_info_pipeline_ipassist", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", + "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D0x1@", + "MetricGroup": "Pipeline;Ret", + "MetricName": "tma_info_pipeline_retire", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", + "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu_core@UOPS_RETIRED.= SLOTS\\,cmask\\=3D0x1@", + "MetricGroup": "MicroSeq;Pipeline;Ret", + "MetricName": "tma_info_pipeline_strings_cycles", + "MetricThreshold": "tma_info_pipeline_strings_cycles > 0.1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of cycles the processor is waiting y= et unhalted; covering legacy PAUSE instruction, as well as C0.1 / C0.2 powe= r-performance optimized states", + "MetricExpr": "CPU_CLK_UNHALTED.C0_WAIT / tma_info_thread_clks", + "MetricGroup": "C0Wait", + "MetricName": "tma_info_system_c0_wait", + "MetricThreshold": "tma_info_system_c0_wait > 0.05", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", + "MetricGroup": "Power;Summary", + "MetricName": "tma_info_system_core_frequency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average CPU Utilization (percentage)", + "MetricExpr": "tma_info_system_cpus_utilized / #num_cpus_online", + "MetricGroup": "HPC;Summary", + "MetricName": "tma_info_system_cpu_utilization", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average number of utilized CPUs", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_cpus_utilized", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / tma_info_system_time", + "MetricGroup": "Cor;Flops;HPC", + "MetricName": "tma_info_system_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;OS", + "MetricName": "tma_info_system_ipfarbranch", + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricGroup": "OS", + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THRE= AD", + "MetricGroup": "OS", + "MetricName": "tma_info_system_kernel_utilization", + "MetricThreshold": "tma_info_system_kernel_utilization > 0.05", + "Unit": "cpu_core" + }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "power@energy\\-pkg@ * 61 / (tma_info_system_time * = 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", + "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_system_turbo_utilization", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Pipeline", + "MetricName": "tma_info_thread_clks", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", + "MetricExpr": "1 / tma_info_thread_ipc", + "MetricGroup": "Mem;Pipeline", + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr", + "Unit": "cpu_core" + }, + { + "BriefDescription": "The ratio of Executed- by Issued-Uops", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", + "MetricGroup": "Cor;Pipeline", + "MetricName": "tma_info_thread_execute_per_issue", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", + "MetricExpr": "INST_RETIRED.ANY / tma_info_thread_clks", + "MetricGroup": "Ret;Summary", + "MetricName": "tma_info_thread_ipc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "slots", + "MetricGroup": "TmaL1;tma_L1_group", + "MetricName": "tma_info_thread_slots", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "tma_retiring * tma_info_thread_slots / INST_RETIRED= .ANY", + "MetricGroup": "Pipeline;Ret;Retire", + "MetricName": "tma_info_thread_uoppi", + "MetricThreshold": "tma_info_thread_uoppi > 1.05", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Uops per taken branch", + "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW", + "MetricName": "tma_info_thread_uptb", + "MetricThreshold": "tma_info_thread_uptb < 8 * 1.5", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "MetricExpr": "tma_divider - tma_fp_divider", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_int_divider", + "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents overall Integer (Int) = select operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_int_vector_128b + tma_int_vector_256b", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_int_operations", + "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operation= s > 0.6", + "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents 128-bit vector Integer= ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the= CPU has retired", + "MetricExpr": "INT_VEC_RETIRED.128BIT / (tma_retiring * tma_info_t= hread_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_128b", + "MetricThreshold": "tma_int_vector_128b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_int_vector_256b, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents 256-bit vector Integer= ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fraction= the CPU has retired", + "MetricExpr": "INT_VEC_RETIRED.256BIT / (tma_retiring * tma_info_t= hread_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_256b", + "MetricThreshold": "tma_int_vector_256b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_ports_utilized_= 2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", + "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", + "MetricExpr": "MEMORY_STALLS.L1 / tma_info_thread_clks", + "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit Level 1 after missing Level 0 within the L1D= cache", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L1_HIT_L1 * cpu_core@MEM_LOAD= _RETIRED.L1_HIT_L1@R, MEM_LOAD_RETIRED.L1_HIT_L1 * 9) if 0 < cpu_core@MEM_L= OAD_RETIRED.L1_HIT_L1@R else MEM_LOAD_RETIRED.L1_HIT_L1 * 9) / tma_info_thr= ead_clks", + "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", + "MetricName": "tma_l1_latency_capacity", + "MetricThreshold": "tma_l1_latency_capacity > 0.1 & tma_l1_bound >= 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", + "MetricExpr": "4 * DEPENDENT_LOADS.ANY / tma_info_thread_clks", + "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: DEPENDENT_LOADS.ANY", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", + "MetricExpr": "MEMORY_STALLS.L2 / tma_info_thread_clks", + "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L2_HIT * cpu_core@MEM_LOAD_RE= TIRED.L2_HIT@R, MEM_LOAD_RETIRED.L2_HIT * (3 * tma_info_system_core_frequen= cy)) if 0 < cpu_core@MEM_LOAD_RETIRED.L2_HIT@R else MEM_LOAD_RETIRED.L2_HIT= * (3 * tma_info_system_core_frequency)) * (1 + MEM_LOAD_RETIRED.FB_HIT / M= EM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled due to loads accesses to L3 cache or contended with a sibling Core", + "MetricExpr": "MEMORY_STALLS.L3 / tma_info_thread_clks", + "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L3_HIT * cpu_core@MEM_LOAD_RE= TIRED.L3_HIT@R, MEM_LOAD_RETIRED.L3_HIT * (12 * tma_info_system_core_freque= ncy) - 3 * tma_info_system_core_frequency) if 0 < cpu_core@MEM_LOAD_RETIRED= .L3_HIT@R else MEM_LOAD_RETIRED.L3_HIT * (12 * tma_info_system_core_frequen= cy) - 3 * tma_info_system_core_frequency) * (1 + MEM_LOAD_RETIRED.FB_HIT / = MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", + "MetricName": "tma_l3_hit_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "DECODE.LCP / tma_info_thread_clks", + "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Load operations", + "MetricExpr": "UOPS_DISPATCHED.LOAD / (3 * tma_info_thread_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Load operations. Sample with: = UOPS_DISPATCHED.LOAD", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) DTLB was missed by load accesses, that late= r on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_dtlb_load - tma_load_stlb_miss)", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", + "MetricName": "tma_load_stlb_hit", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by load accesses, performing a= hardware page walk", + "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", + "MetricName": "tma_load_stlb_miss", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", + "MetricExpr": "MEM_INST_RETIRED.LOCK_LOADS * cpu_core@MEM_INST_RET= IRED.LOCK_LOADS@R / tma_info_thread_clks", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to LSD (Loop Stream Detector) unit", + "MetricExpr": "cpu_core@LSD.UOPS\\,cmask\\=3D0x8\\,inv\\=3D0x1@ / = tma_info_thread_clks", + "MetricGroup": "FetchBW;LSD;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", + "MetricName": "tma_lsd", + "MetricThreshold": "tma_lsd > 0.15 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Machine Clears", + "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts= )", + "MetricGroup": "BadSpec;BvMS;MachineClears;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu_core@OFFCORE_REQUE= STS_OUTSTANDING.DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_sq_full", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the performance was likely hurt due to latency from external memory - DRA= M ([SPR-HBM] and/or HBM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", + "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", + "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_b= ound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "13 * MISC2_RETIRED.LFENCE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", + "MetricName": "tma_memory_fence", + "MetricThreshold": "tma_memory_fence > 0.05 & tma_serializing_oper= ation > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "MetricExpr": "tma_light_operations * MEM_UOP_RETIRED.ANY / (tma_r= etiring * tma_info_thread_slots)", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_memory_operations", + "MetricThreshold": "tma_memory_operations > 0.1 & tma_light_operat= ions > 0.6", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "UOPS_RETIRED.MS / tma_info_thread_slots", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_bottleneck_i= rregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, t= ma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Branch Misprediction= at execution stage", + "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * INT_= MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", + "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", + "MetricName": "tma_mispredicts_resteers", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the MITE pipeline (the legacy deco= de pipeline)", + "MetricExpr": "(cpu_core@IDQ.MITE_UOPS\\,cmask\\=3D0x8\\,inv\\=3D0= x1@ / tma_info_thread_clks + IDQ.MITE_UOPS / (IDQ.DSB_UOPS + IDQ.MITE_UOPS)= * (IDQ_BUBBLES.CYCLES_0_UOPS_DELIV.CORE - IDQ_BUBBLES.FETCH_LATENCY)) / tm= a_info_thread_clks", + "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sa= mple with: FRONTEND_RETIRED.ANY_DSB_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "MetricExpr": "160 * ASSISTS.SSE_AVX_MIX / tma_info_thread_clks", + "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", + "MetricName": "tma_mixing_vectors", + "MetricThreshold": "tma_mixing_vectors > 0.05", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "MetricExpr": "IDQ.MS_CYCLES_ANY / tma_info_thread_clks", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", + "MetricName": "tma_ms", + "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", + "MetricExpr": "3 * IDQ.MS_SWITCHES / tma_info_thread_clks", + "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions that were not fused", + "MetricExpr": "tma_light_operations * (BR_INST_RETIRED.ALL_BRANCHE= S - INST_RETIRED.BR_FUSED) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", + "MetricName": "tma_non_fused_branches", + "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring NOP (no op) instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", + "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", + "MetricName": "tma_nop_instructions", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents the remaining light uo= ps fraction the CPU has executed - remaining means not covered by other sib= ling nodes", + "MetricExpr": "max(0, tma_light_operations - (tma_x87_use + (FP_AR= ITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIRED.VECTOR) / (tma_retiring * t= ma_info_thread_slots) + (INT_VEC_RETIRED.ADD_128 + INT_VEC_RETIRED.VNNI_128= + INT_VEC_RETIRED.ADD_256 + INT_VEC_RETIRED.MUL_256 + INT_VEC_RETIRED.VNNI= _256) / (tma_retiring * tma_info_thread_slots) + tma_memory_operations + tm= a_fused_instructions + tma_non_fused_branches))", + "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_other_light_ops", + "MetricThreshold": "tma_other_light_ops > 0.3 & tma_light_operatio= ns > 0.6", + "PublicDescription": "This metric represents the remaining light u= ops fraction the CPU has executed - remaining means not covered by other si= bling nodes. May undercount due to FMA double counting", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", + "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", + "MetricName": "tma_other_mispredicts", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", + "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", + "MetricName": "tma_other_nukes", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Page Faults", + "MetricExpr": "99 * ASSISTS.PAGE_FAULT / tma_info_thread_slots", + "MetricGroup": "TopdownL5;tma_L5_group;tma_assists_group", + "MetricName": "tma_page_faults", + "MetricThreshold": "tma_page_faults > 0.05", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", + "MetricExpr": "((EXE_ACTIVITY.EXE_BOUND_0_PORTS + (EXE_ACTIVITY.1_= PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_3_PORTS_UTIL)) / tma_info_thread= _clks if ARITH.DIV_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - EXE_ACTIVITY.BOUN= D_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_= 3_PORTS_UTIL) / tma_info_thread_clks)", + "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.EXE_BOUND_0_PORTS / tma_info_thread_cl= ks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", + "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", + "MetricName": "tma_ports_utilized_1", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", + "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", + "MetricName": "tma_ports_utilized_2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma= _int_vector_256b", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", + "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", + "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by (indirect) RET instruction= s", + "MetricExpr": "BR_MISP_RETIRED.RET_COST * cpu_core@BR_MISP_RETIRED= .RET_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_ret_mispredicts", + "MetricThreshold": "tma_ret_mispredicts > 0.05 & tma_branch_mispre= dicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "BvUW;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots u= tilized by useful work i.e. issued uops that eventually get retired. Ideall= y; all pipeline slots would be attributed to the Retiring category. Retiri= ng of 100% would indicate the maximum Pipeline_Width throughput was achieve= d. Maximizing Retiring typically increases the Instructions-per-cycle (see= IPC metric). Note that a high Retiring value does not necessary mean there= is no room for more performance. For example; Heavy-operations or Microco= de Assists are categorized under Retiring. They often indicate suboptimal p= erformance and can often be optimized or avoided. Sample with: UOPS_RETIRED= .SLOTS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU issue-pipeline was stalled due to serializing operations", + "MetricExpr": "(BE_STALLS.SCOREBOARD + CPU_CLK_UNHALTED.C02) / tma= _info_thread_clks", + "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", + "MetricName": "tma_serializing_operation", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: BE_STALLS.SCOREBOARD. Related metrics: tm= a_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring Shuffle operations of 256-bit vector size (FP or Int= eger)", + "MetricExpr": "tma_light_operations * INT_VEC_RETIRED.SHUFFLES / (= tma_retiring * tma_info_thread_slots)", + "MetricGroup": "HPC;Pipeline;TopdownL4;tma_L4_group;tma_other_ligh= t_ops_group", + "MetricName": "tma_shuffles_256b", + "MetricThreshold": "tma_shuffles_256b > 0.1 & tma_other_light_ops = > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to PAUSE Instructions", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "CPU_CLK_UNHALTED.PAUSE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", + "MetricName": "tma_slow_pause", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.= PAUSE_INST", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles hand= ling memory load split accesses - load that cross 64-byte cache line bounda= ry", + "MetricExpr": "(min(MEM_INST_RETIRED.SPLIT_LOADS * cpu_core@MEM_IN= ST_RETIRED.SPLIT_LOADS@R, MEM_INST_RETIRED.SPLIT_LOADS * tma_info_memory_lo= ad_miss_real_latency) if 0 < cpu_core@MEM_INST_RETIRED.SPLIT_LOADS@R else M= EM_INST_RETIRED.SPLIT_LOADS * tma_info_memory_load_miss_real_latency) / tma= _info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents rate of split store ac= cesses", + "MetricExpr": "(min(MEM_INST_RETIRED.SPLIT_STORES * cpu_core@MEM_I= NST_RETIRED.SPLIT_STORES@R, MEM_INST_RETIRED.SPLIT_STORES) if 0 < cpu_core@= MEM_INST_RETIRED.SPLIT_STORES@R else MEM_INST_RETIRED.SPLIT_STORES) / tma_i= nfo_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", + "MetricExpr": "(XQ.FULL + L1D_MISS.L2_STALLS) / tma_info_thread_cl= ks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_mem_bandwidth", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often CPU was stall= ed due to RFO store memory accesses; RFO store issue a read-for-ownership = request before the write", + "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", + "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates fraction of cyc= les when the memory subsystem had loads blocked since they could not forwar= d data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", + "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", + "MetricExpr": "(MEM_STORE_RETIRED.L2_HIT * 10 * (1 - MEM_INST_RETI= RED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_= LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE= _REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Store operations", + "MetricExpr": "(UOPS_DISPATCHED.STD + UOPS_DISPATCHED.STA) / (7 * = tma_info_thread_clks)", + "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Store operations. Sample with:= UOPS_DISPATCHED.STD, UOPS_DISPATCHED.STA", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the TLB was missed by store accesses, hitting in the second-l= evel TLB (STLB)", + "MetricExpr": "max(0, tma_dtlb_store - tma_store_stlb_miss)", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", + "MetricName": "tma_store_stlb_hit", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the STLB was missed by store accesses, performing a hardware page wal= k", + "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_thread_clk= s", + "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", + "MetricName": "tma_store_stlb_miss", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates how often CPU was stall= ed due to Streaming store memory accesses; Streaming store optimize out a = read request required by RFO stores", + "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", + "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", + "MetricName": "tma_streaming_stores", + "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to new branch address clears", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / tma_info_thread_cl= ks", + "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", + "MetricName": "tma_unknown_branches", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric serves as an approximation of leg= acy x87 usage", + "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", + "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", + "ScaleUnit": "100%", + "Unit": "cpu_core" + } +] diff --git a/tools/perf/pmu-events/arch/x86/lunarlake/memory.json b/tools/p= erf/pmu-events/arch/x86/lunarlake/memory.json index 3d12e226d5ef..60daff922a89 100644 --- a/tools/perf/pmu-events/arch/x86/lunarlake/memory.json +++ b/tools/perf/pmu-events/arch/x86/lunarlake/memory.json @@ -1,13 +1,168 @@ [ { - "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 1024 cycles.", + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer is stalled due to any number of reasons, incl= uding an L1 miss, WCB full, pagewalk, store address block or store data blo= ck.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.ANY", + "SampleAfterValue": "1000003", + "UMask": "0x7f", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer is stalled due to any number of reasons, incl= uding an L1 miss, WCB full, pagewalk, store address block or store data blo= ck, on a load that retires.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.ANY_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0xff", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer is stalled due to a core bound stall includin= g a store address match, a DTLB miss or a page walk that detains the load f= rom retiring.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.L1_BOUND_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0xf4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer is stalled due to a DL1 miss.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.L1_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to a DL1 = miss.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.L1_MISS_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0x81", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer is stalled due to other block cases.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.OTHER", + "PublicDescription": "Counts the number of cycles that the head (o= ldest load) of the load buffer is stalled due to other block cases such as = pipeline conflicts, fences, etc.", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to other = block cases.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.OTHER_AT_RET", + "PublicDescription": "Counts the number of cycles that the head (o= ldest load) of the load buffer and retirement are both stalled due to other= block cases such as pipeline conflicts, fences, etc.", + "SampleAfterValue": "1000003", + "UMask": "0xc0", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer is stalled due to a pagewalk.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.PGWALK", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to a page= walk.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.PGWALK_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0xa0", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer is stalled due to a store address match.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.ST_ADDR", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to a stor= e address match.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.ST_ADDR_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0x84", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer is stalled due to store data forward block.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.ST_DATA", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer is stalled due to request buffers full or loc= k in progress.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.WCB_FULL", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to reques= t buffers full or lock in progress.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.WCB_FULL_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0x82", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of memory ordering machine = clears triggered due to a snoop from an external agent. Does not count inte= rnally generated machine clears such as those due to disambiguations.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.MEMORY_ORDERING", + "SampleAfterValue": "20003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of machine clears due to memory orderi= ng conflicts.", "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.MEMORY_ORDERING", + "PublicDescription": "Counts the number of Machine Clears detected= dye to memory ordering. Memory Ordering Machine Clears may apply when a me= mory read may not conform to the memory ordering rules of the x86 architect= ure", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of machine clears that flus= h the pipeline and restart the machine without the use of microcode.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.MEMORY_ORDERING_FAST", + "SampleAfterValue": "20003", + "UMask": "0x82", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 1024 cycles.", + "Counter": "2,3,4,5,6,7,8,9", "Data_LA": "1", "EventCode": "0xcd", "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_1024", "MSRIndex": "0x3F6", "MSRValue": "0x400", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 1024 cycles. Reporte= d latency may be longer than just the memory latency.", "SampleAfterValue": "53", "UMask": "0x1", @@ -15,13 +170,12 @@ }, { "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 128 cycles.", - "Counter": "0,1,2,3,4,5,6,7,8,9", + "Counter": "2,3,4,5,6,7,8,9", "Data_LA": "1", "EventCode": "0xcd", "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_128", "MSRIndex": "0x3F6", "MSRValue": "0x80", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 128 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "1009", "UMask": "0x1", @@ -29,13 +183,12 @@ }, { "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 16 cycles.", - "Counter": "0,1,2,3,4,5,6,7,8,9", + "Counter": "2,3,4,5,6,7,8,9", "Data_LA": "1", "EventCode": "0xcd", "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_16", "MSRIndex": "0x3F6", "MSRValue": "0x10", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 16 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "20011", "UMask": "0x1", @@ -43,13 +196,12 @@ }, { "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 2048 cycles.", - "Counter": "0,1,2,3,4,5,6,7,8,9", + "Counter": "2,3,4,5,6,7,8,9", "Data_LA": "1", "EventCode": "0xcd", "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_2048", "MSRIndex": "0x3F6", "MSRValue": "0x800", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 2048 cycles. Reporte= d latency may be longer than just the memory latency.", "SampleAfterValue": "23", "UMask": "0x1", @@ -57,13 +209,12 @@ }, { "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 256 cycles.", - "Counter": "0,1,2,3,4,5,6,7,8,9", + "Counter": "2,3,4,5,6,7,8,9", "Data_LA": "1", "EventCode": "0xcd", "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_256", "MSRIndex": "0x3F6", "MSRValue": "0x100", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 256 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "503", "UMask": "0x1", @@ -71,13 +222,12 @@ }, { "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 32 cycles.", - "Counter": "0,1,2,3,4,5,6,7,8,9", + "Counter": "2,3,4,5,6,7,8,9", "Data_LA": "1", "EventCode": "0xcd", "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32", "MSRIndex": "0x3F6", "MSRValue": "0x20", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 32 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -85,13 +235,12 @@ }, { "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 4 cycles.", - "Counter": "0,1,2,3,4,5,6,7,8,9", + "Counter": "2,3,4,5,6,7,8,9", "Data_LA": "1", "EventCode": "0xcd", "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4", "MSRIndex": "0x3F6", "MSRValue": "0x4", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 4 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "100003", "UMask": "0x1", @@ -99,13 +248,12 @@ }, { "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 512 cycles.", - "Counter": "0,1,2,3,4,5,6,7,8,9", + "Counter": "2,3,4,5,6,7,8,9", "Data_LA": "1", "EventCode": "0xcd", "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_512", "MSRIndex": "0x3F6", "MSRValue": "0x200", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 512 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "101", "UMask": "0x1", @@ -113,13 +261,12 @@ }, { "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 64 cycles.", - "Counter": "0,1,2,3,4,5,6,7,8,9", + "Counter": "2,3,4,5,6,7,8,9", "Data_LA": "1", "EventCode": "0xcd", "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64", "MSRIndex": "0x3F6", "MSRValue": "0x40", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 64 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "2003", "UMask": "0x1", @@ -127,13 +274,12 @@ }, { "BriefDescription": "Counts randomly selected loads when the laten= cy from first dispatch to completion is greater than 8 cycles.", - "Counter": "0,1,2,3,4,5,6,7,8,9", + "Counter": "2,3,4,5,6,7,8,9", "Data_LA": "1", "EventCode": "0xcd", "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_8", "MSRIndex": "0x3F6", "MSRValue": "0x8", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 8 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "50021", "UMask": "0x1", @@ -145,54 +291,112 @@ "Data_LA": "1", "EventCode": "0xcd", "EventName": "MEM_TRANS_RETIRED.STORE_SAMPLE", - "PEBS": "2", "PublicDescription": "Counts Retired memory accesses with at least= 1 store operation. This PEBS event is the precisely-distributed (PDist) tr= igger covering all stores uops for sampling by the PEBS Store Latency Facil= ity. The facility is described in Intel SDM Volume 3 section 19.9.8", "SampleAfterValue": "1000003", "UMask": "0x2", "Unit": "cpu_core" }, { - "BriefDescription": "Counts cacheable demand data reads were not s= upplied by the L3 cache.", + "BriefDescription": "Counts misaligned loads that are 4K page spli= ts.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x13", + "EventName": "MISALIGN_MEM_REF.LOAD_PAGE_SPLIT", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts misaligned stores that are 4K page spl= its.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x13", + "EventName": "MISALIGN_MEM_REF.STORE_PAGE_SPLIT", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were not supplied by the L3 cache and were su= pplied by the system memory (DRAM, MSC, or MMIO).", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.L3_MISS", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x13FBFC00004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand data reads that were not suppli= ed by the L3 cache and were supplied by the system memory (DRAM, MSC, or MM= IO).", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xB7", "EventName": "OCR.DEMAND_DATA_RD.L3_MISS", "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3FBFC00001", + "MSRValue": "0x13FBFC00001", "SampleAfterValue": "100003", "UMask": "0x1", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts demand data reads that were not suppli= ed by the L3 cache.", + "BriefDescription": "Counts demand data reads that were not suppli= ed by the L3 cache and were supplied by the system memory (DRAM, MSC, or MM= IO).", "Counter": "0,1,2,3", "EventCode": "0x2A,0x2B", "EventName": "OCR.DEMAND_DATA_RD.L3_MISS", "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0xFE7F8000001", + "MSRValue": "0x9E7FA000001", "SampleAfterValue": "100003", "UMask": "0x1", "Unit": "cpu_core" }, { - "BriefDescription": "Counts demand reads for ownership, including = SWPREFETCHW which is an RFO were not supplied by the L3 cache.", + "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that were no= t supplied by the L3 cache and were supplied by the system memory (DRAM, MS= C, or MMIO).", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xB7", "EventName": "OCR.DEMAND_RFO.L3_MISS", "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0x3FBFC00002", + "MSRValue": "0x13FBFC00002", "SampleAfterValue": "100003", "UMask": "0x1", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that were no= t supplied by the L3 cache.", + "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that were no= t supplied by the L3 cache and were supplied by the system memory (DRAM, MS= C, or MMIO).", "Counter": "0,1,2,3", "EventCode": "0x2A,0x2B", "EventName": "OCR.DEMAND_RFO.L3_MISS", "MSRIndex": "0x1a6,0x1a7", - "MSRValue": "0xFE7F8000002", + "MSRValue": "0x9E7FA000002", "SampleAfterValue": "100003", "UMask": "0x1", "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts demand data read requests that miss th= e L3 cache.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x21", + "EventName": "OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles where data return is pending for a Dem= and Data Read request who miss L3 cache.", + "Counter": "0,1,2,3", + "CounterMask": "1", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_L3_MISS_DEM= AND_DATA_RD", + "PublicDescription": "Cycles with at least 1 Demand Data Read requ= ests who miss L3 cache in the superQ.", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "For every cycle, increments by the number of = demand data read requests pending that are known to have missed the L3 cach= e.", + "Counter": "0,1,2,3", + "EventCode": "0x20", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAND_DATA_RD", + "PublicDescription": "For every cycle, increments by the number of= demand data read requests pending that are known to have missed the L3 cac= he. Note that this does not capture all elapsed cycles while requests are = outstanding - only cycles from when the requests were known by the requesti= ng core to have missed the L3 cache.", + "SampleAfterValue": "2000003", + "UMask": "0x10", + "Unit": "cpu_core" } ] diff --git a/tools/perf/pmu-events/arch/x86/lunarlake/metricgroups.json b/t= ools/perf/pmu-events/arch/x86/lunarlake/metricgroups.json new file mode 100644 index 000000000000..855585fe6fae --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/lunarlake/metricgroups.json @@ -0,0 +1,150 @@ +{ + "Backend": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "Bad": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "BadSpec": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "BigFootprint": "Grouping from Top-down Microarchitecture Analysis Met= rics spreadsheet", + "BrMispredicts": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", + "Branches": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "BvBC": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvBO": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvCB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvFB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvIO": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvML": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMP": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMS": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMT": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvOB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvUW": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "C0Wait": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "CacheHits": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "CacheMisses": "Grouping from Top-down Microarchitecture Analysis Metr= ics spreadsheet", + "CodeGen": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "Compute": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "Cor": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "DSB": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "DSBmiss": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "DataSharing": "Grouping from Top-down Microarchitecture Analysis Metr= ics spreadsheet", + "Fed": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "FetchBW": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "FetchLat": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "Flops": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "FpScalar": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "FpVector": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "Frontend": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "HPC": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "IcMiss": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "Ifetch": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "InsType": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "IntVector": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Load_Store_Miss": "Grouping from Top-down Microarchitecture Analysis = Metrics spreadsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", + "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", + "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "MemOffcore": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", + "Mem_Exec": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "MemoryBW": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "MemoryBound": "Grouping from Top-down Microarchitecture Analysis Metr= ics spreadsheet", + "MemoryLat": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "MemoryTLB": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "Memory_BW": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "Memory_Lat": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", + "MicroSeq": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "OS": "Grouping from Top-down Microarchitecture Analysis Metrics sprea= dsheet", + "Offcore": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "PGO": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Pipeline": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "PortsUtil": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", + "Power": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "Prefetches": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", + "Ret": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Retire": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "SMT": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Server": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", + "Snoop": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "SoC": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "Summary": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", + "TmaL1": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "TmaL2": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "TmaL3mem": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", + "TopdownL1": "Metrics for top-down breakdown at level 1", + "TopdownL2": "Metrics for top-down breakdown at level 2", + "TopdownL3": "Metrics for top-down breakdown at level 3", + "TopdownL4": "Metrics for top-down breakdown at level 4", + "TopdownL5": "Metrics for top-down breakdown at level 5", + "TopdownL6": "Metrics for top-down breakdown at level 6", + "load_store_bound": "Grouping from Top-down Microarchitecture Analysis= Metrics spreadsheet", + "tma_L1_group": "Metrics for top-down breakdown at level 1", + "tma_L2_group": "Metrics for top-down breakdown at level 2", + "tma_L3_group": "Metrics for top-down breakdown at level 3", + "tma_L4_group": "Metrics for top-down breakdown at level 4", + "tma_L5_group": "Metrics for top-down breakdown at level 5", + "tma_L6_group": "Metrics for top-down breakdown at level 6", + "tma_alu_op_utilization_group": "Metrics contributing to tma_alu_op_ut= ilization category", + "tma_assists_group": "Metrics contributing to tma_assists category", + "tma_backend_bound_group": "Metrics contributing to tma_backend_bound = category", + "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", + "tma_branch_mispredicts_group": "Metrics contributing to tma_branch_mi= spredicts category", + "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", + "tma_code_stlb_miss_group": "Metrics contributing to tma_code_stlb_mis= s category", + "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", + "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", + "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", + "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", + "tma_fetch_bandwidth_group": "Metrics contributing to tma_fetch_bandwi= dth category", + "tma_fetch_latency_group": "Metrics contributing to tma_fetch_latency = category", + "tma_fp_arith_group": "Metrics contributing to tma_fp_arith category", + "tma_fp_vector_group": "Metrics contributing to tma_fp_vector category= ", + "tma_frontend_bound_group": "Metrics contributing to tma_frontend_boun= d category", + "tma_heavy_operations_group": "Metrics contributing to tma_heavy_opera= tions category", + "tma_icache_misses_group": "Metrics contributing to tma_icache_misses = category", + "tma_ifetch_bandwidth_group": "Metrics contributing to tma_ifetch_band= width category", + "tma_ifetch_latency_group": "Metrics contributing to tma_ifetch_latenc= y category", + "tma_int_operations_group": "Metrics contributing to tma_int_operation= s category", + "tma_issue2P": "Metrics related by the issue $issue2P", + "tma_issueBM": "Metrics related by the issue $issueBM", + "tma_issueBW": "Metrics related by the issue $issueBW", + "tma_issueComp": "Metrics related by the issue $issueComp", + "tma_issueD0": "Metrics related by the issue $issueD0", + "tma_issueFB": "Metrics related by the issue $issueFB", + "tma_issueFL": "Metrics related by the issue $issueFL", + "tma_issueL1": "Metrics related by the issue $issueL1", + "tma_issueLat": "Metrics related by the issue $issueLat", + "tma_issueMC": "Metrics related by the issue $issueMC", + "tma_issueMS": "Metrics related by the issue $issueMS", + "tma_issueMV": "Metrics related by the issue $issueMV", + "tma_issueRFO": "Metrics related by the issue $issueRFO", + "tma_issueSL": "Metrics related by the issue $issueSL", + "tma_issueSO": "Metrics related by the issue $issueSO", + "tma_issueSmSt": "Metrics related by the issue $issueSmSt", + "tma_issueSpSt": "Metrics related by the issue $issueSpSt", + "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", + "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", + "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", + "tma_l2_bound_group": "Metrics contributing to tma_l2_bound category", + "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", + "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", + "tma_load_op_utilization_group": "Metrics contributing to tma_load_op_= utilization category", + "tma_load_stlb_miss_group": "Metrics contributing to tma_load_stlb_mis= s category", + "tma_machine_clears_group": "Metrics contributing to tma_machine_clear= s category", + "tma_mem_latency_group": "Metrics contributing to tma_mem_latency cate= gory", + "tma_memory_bound_group": "Metrics contributing to tma_memory_bound ca= tegory", + "tma_microcode_sequencer_group": "Metrics contributing to tma_microcod= e_sequencer category", + "tma_mite_group": "Metrics contributing to tma_mite category", + "tma_other_light_ops_group": "Metrics contributing to tma_other_light_= ops category", + "tma_ports_utilization_group": "Metrics contributing to tma_ports_util= ization category", + "tma_ports_utilized_0_group": "Metrics contributing to tma_ports_utili= zed_0 category", + "tma_ports_utilized_3m_group": "Metrics contributing to tma_ports_util= ized_3m category", + "tma_resource_bound_group": "Metrics contributing to tma_resource_boun= d category", + "tma_retiring_group": "Metrics contributing to tma_retiring category", + "tma_serializing_operation_group": "Metrics contributing to tma_serial= izing_operation category", + "tma_store_bound_group": "Metrics contributing to tma_store_bound cate= gory", + "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category", + "tma_store_stlb_miss_group": "Metrics contributing to tma_store_stlb_m= iss category" +} diff --git a/tools/perf/pmu-events/arch/x86/lunarlake/other.json b/tools/pe= rf/pmu-events/arch/x86/lunarlake/other.json index 0b49b4684c4b..667707d4fe37 100644 --- a/tools/perf/pmu-events/arch/x86/lunarlake/other.json +++ b/tools/perf/pmu-events/arch/x86/lunarlake/other.json @@ -1,6 +1,345 @@ [ { - "BriefDescription": "Counts cacheable demand data reads Catch all = value for any response types - this includes response types not define in t= he OCR. If this is set all other response types will be ignored", + "BriefDescription": "Count all other hardware assists or traps tha= t are not necessarily architecturally exposed (through a software handler) = beyond FP; SSE-AVX mix and A/D assists who are counted by dedicated sub-eve= nts.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc1", + "EventName": "ASSISTS.HARDWARE", + "PublicDescription": "Count all other hardware assists or traps th= at are not necessarily architecturally exposed (through a software handler)= beyond FP; SSE-AVX mix and A/D assists who are counted by dedicated sub-ev= ents. This includes, but not limited to, assists at EXE or MEM uop writeba= ck like AVX* load/store/gather/scatter (non-FP GSSE-assist ) , assists gene= rated by ROB like PEBS and RTIT, Uncore trap, RAR (Remote Action Request) a= nd CET (Control flow Enforcement Technology) assists.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "ASSISTS.PAGE_FAULT", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc1", + "EventName": "ASSISTS.PAGE_FAULT", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts cycles where the pipeline is stalled d= ue to serializing operations.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa2", + "EventName": "BE_STALLS.SCOREBOARD", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of unhalted cycles a Core i= s blocked due to a lock In Progress issued by another core", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x63", + "EventName": "BUS_LOCK.BLOCKED_CYCLES", + "PublicDescription": "Counts the number of unhalted cycles a Core = is blocked due to a lock In Progress issued by another core. Counts on a pe= r core basis.", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of unhalted cycles a Core i= s blocked due to an Accepted lock it issued, includes both split and non-sp= lit lock cycles.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x63", + "EventName": "BUS_LOCK.LOCK_CYCLES", + "PublicDescription": "Counts the number of unhalted cycles a Core = is blocked due to an Accepted lock it issued, includes both split and non-s= plit lock cycles. Counts on a per core basis.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of non-split locks such as = UC locks issued by a Core (does not include cache locks)", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x63", + "EventName": "BUS_LOCK.NON_SPLIT_LOCKS", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of split locks issued by a = Core", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x63", + "EventName": "BUS_LOCK.SPLIT_LOCKS", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Count number of times a load is depending on = another load that had just write back its data or in previous or 2 cycles = back. This event supports in-direct dependency through a single uop.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x02", + "EventName": "DEPENDENT_LOADS.ANY", + "SampleAfterValue": "1000003", + "UMask": "0x7", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of cycles the L2 Prefetcher= s are at throttle level 0", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x32", + "EventName": "DYNAMIC_PREFETCH_THROTTLER.LEVEL0_SOC", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the L2 Prefetcher= throttle level is at 1", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x32", + "EventName": "DYNAMIC_PREFETCH_THROTTLER.LEVEL1_SOC", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the L2 Prefetcher= throttle level is at 2", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x32", + "EventName": "DYNAMIC_PREFETCH_THROTTLER.LEVEL2_SOC", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the L2 Prefetcher= throttle level is at 3", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x32", + "EventName": "DYNAMIC_PREFETCH_THROTTLER.LEVEL3_SOC", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the L2 Prefetcher= throttle level is at 4", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x32", + "EventName": "DYNAMIC_PREFETCH_THROTTLER.LEVEL4_SOC", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on all Int= eger ports.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.ALL", + "SampleAfterValue": "1000003", + "UMask": "0xff", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on a load = port.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.LD", + "PublicDescription": "Counts the number of uops executed on a load= port. This event counts for integer uops even if the destination is FP/ve= ctor", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on integer= port 0.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.P0", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on integer= port 1.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.P1", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on integer= port 2.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.P2", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on integer= port 3.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.P3", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on integer= port 0,1, 2, 3.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.PRIMARY", + "SampleAfterValue": "1000003", + "UMask": "0x78", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on a Store= address port.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.STA", + "PublicDescription": "Counts the number of uops executed on a Stor= e address port. This event counts integer uops even if the data source is F= P/vector", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of uops executed on an inte= ger store data and jump port.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xb3", + "EventName": "INT_UOPS_EXECUTED.STD_JMP", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of LLC prefetches that were= throttled due to Dynamic Prefetch Throttling. The throttle requestor/sou= rce could be from the uncore/SOC or the Dead Block Predictor. Counts on a p= er core basis.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x29", + "EventName": "LLC_PREFETCHES_THROTTLED.DPT", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of LLC prefetches throttled= due to Demand Throttle Prefetcher. DTP Global Triggered with no Local Ove= rride. Counts on a per core basis.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x29", + "EventName": "LLC_PREFETCHES_THROTTLED.DTP", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of LLC prefetches not throt= tled by DTP due to local override. These prefetches may still be throttled= due to another throttler mechanism. Counts on a per core basis.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x29", + "EventName": "LLC_PREFETCHES_THROTTLED.DTP_OVERRIDE", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of LLC prefetches throttled= due to LLC hit rate in . Counts on a per core basis= .", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x29", + "EventName": "LLC_PREFETCHES_THROTTLED.HIT_RATE", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of LLC prefetches throttled= due to exceeding the XQ threshold set by either XQ_THRESOLD_DTP or LLC_XQ_= THRESHOLD. Counts on a per core basis.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x29", + "EventName": "LLC_PREFETCHES_THROTTLED.XQ_THRESH", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for L1 cache (that is: no execution & load in flight = & no load missed L1 cache)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x46", + "EventName": "MEMORY_STALLS.L1", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for L2 cache (that is: no execution & load in flight = & load missed L1 & no load missed L2 cache)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x46", + "EventName": "MEMORY_STALLS.L2", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for L3 cache (that is: no execution & load in flight = & load missed L1 & load missed L2 cache & no load missed L3 Cache)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x46", + "EventName": "MEMORY_STALLS.L3", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts cycles where no execution is happening= due to loads waiting for Memory (that is: no execution & load in flight & = a load missed L3 cache)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x46", + "EventName": "MEMORY_STALLS.MEM", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts all requests that have any type of res= ponse.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.ALL_REQUESTS.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0xFF0000001DFFF", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts writebacks of modified cachelines that= have any type of response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.COREWB_M.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10008", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts writebacks of non-modified cachelines = that have any type of response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.COREWB_NONM.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x11000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that have any type of response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1FBC000004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand data reads that have any type o= f response.", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xB7", "EventName": "OCR.DEMAND_DATA_RD.ANY_RESPONSE", @@ -22,7 +361,7 @@ "Unit": "cpu_core" }, { - "BriefDescription": "Counts cacheable demand data reads were suppl= ied by DRAM.", + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM.", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xB7", "EventName": "OCR.DEMAND_DATA_RD.DRAM", @@ -44,7 +383,7 @@ "Unit": "cpu_core" }, { - "BriefDescription": "Counts demand reads for ownership, including = SWPREFETCHW which is an RFO Catch all value for any response types - this i= ncludes response types not define in the OCR. If this is set all other res= ponse types will be ignored", + "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that have an= y type of response.", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xB7", "EventName": "OCR.DEMAND_RFO.ANY_RESPONSE", @@ -64,5 +403,156 @@ "SampleAfterValue": "100003", "UMask": "0x1", "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts demand read for ownership (RFO) reques= ts and software prefetches for exclusive ownership (PREFETCHW) that were su= pplied by DRAM.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_RFO.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x1FBC000002", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts full streaming stores (64 bytes, WCiLF= ) that have any type of response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.FULL_STREAMING_WR.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x800000010000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts partial streaming stores (less than 64= bytes, WCiL) that have any type of response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.PARTIAL_STREAMING_WR.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x400000010000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts streaming stores that have any type of= response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.STREAMING_WR.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10800", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts streaming stores that have any type of= response.", + "Counter": "0,1,2,3", + "EventCode": "0x2A,0x2B", + "EventName": "OCR.STREAMING_WR.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10800", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when Reservation Station (RS) is empty= for the thread.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa5", + "EventName": "RS.EMPTY", + "PublicDescription": "Counts cycles during which the reservation s= tation (RS) is empty for this logical processor. This is usually caused whe= n the front-end pipeline runs into starvation periods (e.g. branch mispredi= ctions or i-cache misses)", + "SampleAfterValue": "1000003", + "UMask": "0x7", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts end of periods where the Reservation S= tation (RS) was empty.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0xa5", + "EventName": "RS.EMPTY_COUNT", + "Invert": "1", + "PublicDescription": "Counts end of periods where the Reservation = Station (RS) was empty. Could be useful to closely sample on front-end late= ncy issues (see the FRONTEND_RETIRED event of designated precise events)", + "SampleAfterValue": "100003", + "UMask": "0x7", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when RS was empty and a resource alloc= ation stall is asserted", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa5", + "EventName": "RS.EMPTY_RESOURCE", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of issue slots in a UMWAIT = or TPAUSE instruction where no uop issues due to the instruction putting th= e CPU into the C0.1 activity state.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x75", + "EventName": "SERIALIZATION.C01_MS_SCB", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number issue slots not consumed d= ue to a color request for an FCW or MXCSR control register when all 4 colo= rs (copies) are already in use", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x75", + "EventName": "SERIALIZATION.COLOR_STALLS", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles the uncore cannot take further request= s", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x2d", + "EventName": "XQ.FULL", + "PublicDescription": "number of cycles when the thread is active a= nd the uncore cannot take any further requests (for example prefetches, loa= ds or stores initiated by the Core that miss the L2 cache).", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of prefetch requests that w= ere promoted in the XQ to a demand request.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xf4", + "EventName": "XQ_PROMOTION.ALL", + "SampleAfterValue": "1000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of prefetch requests that w= ere promoted in the XQ to a demand code read.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xf4", + "EventName": "XQ_PROMOTION.CRDS", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of prefetch requests that w= ere promoted in the XQ to a demand read.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xf4", + "EventName": "XQ_PROMOTION.DRDS", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of prefetch requests that w= ere promoted in the XQ to a demand RFO.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xf4", + "EventName": "XQ_PROMOTION.RFOS", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" } ] diff --git a/tools/perf/pmu-events/arch/x86/lunarlake/pipeline.json b/tools= /perf/pmu-events/arch/x86/lunarlake/pipeline.json index 220c2115fec9..59e2ae652b2d 100644 --- a/tools/perf/pmu-events/arch/x86/lunarlake/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/lunarlake/pipeline.json @@ -1,325 +1,2002 @@ [ + { + "BriefDescription": "Counts the number of cycles when any of the f= loating point or integer dividers are active.", + "Counter": "0,1,2,3,4,5,6,7", + "CounterMask": "1", + "EventCode": "0xcd", + "EventName": "ARITH.DIV_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles when divide unit is busy executing div= ide or square root operations.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0xb0", + "EventName": "ARITH.DIV_ACTIVE", + "PublicDescription": "Counts cycles when divide unit is busy execu= ting divide or square root operations. Accounts for integer and floating-po= int operations.", + "SampleAfterValue": "1000003", + "UMask": "0x9", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of active floating point an= d integer dividers per cycle.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xcd", + "EventName": "ARITH.DIV_OCCUPANCY", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of floating point and integ= er divider uops executed per cycle.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xcd", + "EventName": "ARITH.DIV_UOPS", + "SampleAfterValue": "1000003", + "UMask": "0xc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles when any of the i= nteger dividers are active.", + "Counter": "0,1,2,3,4,5,6,7", + "CounterMask": "1", + "EventCode": "0xcd", + "EventName": "ARITH.IDIV_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles when integer divide unit is busy execu= ting divide or square root operations.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0xb0", + "EventName": "ARITH.IDIV_ACTIVE", + "PublicDescription": "Counts cycles when divide unit is busy execu= ting divide or square root operations. Accounts for integer operations only= .", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of active integer dividers = per cycle.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xcd", + "EventName": "ARITH.IDIV_OCCUPANCY", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of integer divider uops exe= cuted per cycle.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xcd", + "EventName": "ARITH.IDIV_UOPS", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of occurrences where a microcode assis= t is invoked by hardware.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc1", + "EventName": "ASSISTS.ANY", + "PublicDescription": "Counts the number of occurrences where a mic= rocode assist is invoked by hardware. Examples include AD (page Access Dirt= y), FP and AVX related assists.", + "SampleAfterValue": "100003", + "UMask": "0x1f", + "Unit": "cpu_core" + }, { "BriefDescription": "Counts the total number of branch instruction= s retired for all branch types.", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts the total number of instructions in w= hich the instruction pointer (IP) of the processor is resteered due to a br= anch instruction and the branch instruction successfully retires. All bran= ch type instructions are accounted for.", "SampleAfterValue": "200003", "Unit": "cpu_atom" }, { - "BriefDescription": "All branch instructions retired.", - "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0xc4", - "EventName": "BR_INST_RETIRED.ALL_BRANCHES", - "PEBS": "1", - "PublicDescription": "Counts all branch instructions retired.", - "SampleAfterValue": "400009", - "Unit": "cpu_core" + "BriefDescription": "All branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.ALL_BRANCHES", + "PublicDescription": "Counts all branch instructions retired.", + "SampleAfterValue": "400009", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts retired JCC (Jump on Conditional Code)= branch instructions retired includes both taken and not taken branches", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND", + "SampleAfterValue": "200003", + "UMask": "0x7e", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Conditional branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND", + "PublicDescription": "Counts conditional branch instructions retir= ed.", + "SampleAfterValue": "400009", + "UMask": "0x111", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of not taken JCC branch ins= tructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND_NTAKEN", + "SampleAfterValue": "200003", + "UMask": "0x7f", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Not taken branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND_NTAKEN", + "PublicDescription": "Counts not taken branch instructions retired= .", + "SampleAfterValue": "400009", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of taken JCC branch instruc= tions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND_TAKEN", + "SampleAfterValue": "200003", + "UMask": "0xfe", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Taken conditional branch instructions retired= .", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND_TAKEN", + "PublicDescription": "Counts taken conditional branch instructions= retired.", + "SampleAfterValue": "400009", + "UMask": "0x101", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Taken backward conditional branch instruction= s retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND_TAKEN_BWD", + "PublicDescription": "Counts taken backward conditional branch ins= tructions retired.", + "SampleAfterValue": "400009", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Taken forward conditional branch instructions= retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.COND_TAKEN_FWD", + "PublicDescription": "Counts taken forward conditional branch inst= ructions retired.", + "SampleAfterValue": "400009", + "UMask": "0x102", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of far branch instructions = retired, includes far jump, far call and return, and Interrupt call and ret= urn", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.FAR_BRANCH", + "SampleAfterValue": "200003", + "UMask": "0xbf", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Far branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.FAR_BRANCH", + "PublicDescription": "Counts far branch instructions retired.", + "SampleAfterValue": "100007", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of near indirect JMP and ne= ar indirect CALL branch instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.INDIRECT", + "SampleAfterValue": "200003", + "UMask": "0xeb", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Indirect near branch instructions retired (ex= cluding returns)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.INDIRECT", + "PublicDescription": "Counts near indirect branch instructions ret= ired excluding returns. TSX abort is an indirect branch.", + "SampleAfterValue": "100003", + "UMask": "0x80", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of near indirect CALL branc= h instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.INDIRECT_CALL", + "SampleAfterValue": "200003", + "UMask": "0xfb", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of near indirect JMP branch= instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.INDIRECT_JMP", + "SampleAfterValue": "200003", + "UMask": "0xef", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of near CALL branch instruc= tions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.NEAR_CALL", + "SampleAfterValue": "200003", + "UMask": "0xf9", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Direct and indirect near call instructions re= tired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.NEAR_CALL", + "PublicDescription": "Counts both direct and indirect near call in= structions retired.", + "SampleAfterValue": "100007", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of near RET branch instruct= ions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.NEAR_RETURN", + "SampleAfterValue": "200003", + "UMask": "0xf7", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Return instructions retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.NEAR_RETURN", + "PublicDescription": "Counts return instructions retired.", + "SampleAfterValue": "100007", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of taken branch instruction= s retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.NEAR_TAKEN", + "SampleAfterValue": "200003", + "UMask": "0xc0", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Taken branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.NEAR_TAKEN", + "PublicDescription": "Counts taken branch instructions retired.", + "SampleAfterValue": "400009", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of near relative CALL branc= h instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.REL_CALL", + "SampleAfterValue": "200003", + "UMask": "0xfd", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the total number of mispredicted branc= h instructions retired for all branch types.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", + "PublicDescription": "Counts the total number of mispredicted bran= ch instructions retired. All branch type instructions are accounted for. = Prediction of the branch target address enables the processor to begin exec= uting instructions before the non-speculative execution path is known. The = branch prediction unit (BPU) predicts the target address based on the instr= uction pointer (IP) of the branch and on the execution path through which e= xecution reached this IP. A branch misprediction occurs when the predict= ion is wrong, and results in discarding all instructions executed in the sp= eculative path and re-fetching from the correct path.", + "SampleAfterValue": "200003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "All mispredicted branch instructions retired.= ", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", + "PublicDescription": "Counts all the retired branch instructions t= hat were mispredicted by the processor. A branch misprediction occurs when = the processor incorrectly predicts the destination of the branch. When the= misprediction is discovered at execution, all the instructions executed in= the wrong (speculative) path must be discarded, and the processor must sta= rt fetching from the correct path.", + "SampleAfterValue": "400009", + "Unit": "cpu_core" + }, + { + "BriefDescription": "All mispredicted branch instructions retired.= This precise event may be used to get the misprediction cost via the Retir= e_Latency field of PEBS. It fires on the instruction that immediately follo= ws the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.ALL_BRANCHES_COST", + "SampleAfterValue": "400009", + "UMask": "0x44", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted JCC branch = instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND", + "SampleAfterValue": "200003", + "UMask": "0x7e", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Mispredicted conditional branch instructions = retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND", + "PublicDescription": "Counts mispredicted conditional branch instr= uctions retired.", + "SampleAfterValue": "400009", + "UMask": "0x111", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Mispredicted conditional branch instructions = retired. This precise event may be used to get the misprediction cost via t= he Retire_Latency field of PEBS. It fires on the instruction that immediate= ly follows the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_COST", + "SampleAfterValue": "400009", + "UMask": "0x151", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted not taken J= CC branch instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_NTAKEN", + "SampleAfterValue": "200003", + "UMask": "0x7f", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Mispredicted non-taken conditional branch ins= tructions retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_NTAKEN", + "PublicDescription": "Counts the number of conditional branch inst= ructions retired that were mispredicted and the branch direction was not ta= ken.", + "SampleAfterValue": "400009", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Mispredicted non-taken conditional branch ins= tructions retired. This precise event may be used to get the misprediction = cost via the Retire_Latency field of PEBS. It fires on the instruction that= immediately follows the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_NTAKEN_COST", + "SampleAfterValue": "400009", + "UMask": "0x50", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted taken JCC b= ranch instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_TAKEN", + "SampleAfterValue": "200003", + "UMask": "0xfe", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "number of branch instructions retired that we= re mispredicted and taken.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_TAKEN", + "PublicDescription": "Counts taken conditional mispredicted branch= instructions retired.", + "SampleAfterValue": "400009", + "UMask": "0x101", + "Unit": "cpu_core" + }, + { + "BriefDescription": "number of branch instructions retired that we= re mispredicted and taken backward.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_TAKEN_BWD", + "PublicDescription": "Counts taken backward conditional mispredict= ed branch instructions retired.", + "SampleAfterValue": "400009", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "number of branch instructions retired that we= re mispredicted and taken backward. This precise event may be used to get t= he misprediction cost via the Retire_Latency field of PEBS. It fires on the= instruction that immediately follows the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_TAKEN_BWD_COST", + "SampleAfterValue": "400009", + "UMask": "0x8001", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Mispredicted taken conditional branch instruc= tions retired. This precise event may be used to get the misprediction cost= via the Retire_Latency field of PEBS. It fires on the instruction that imm= ediately follows the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_TAKEN_COST", + "SampleAfterValue": "400009", + "UMask": "0x141", + "Unit": "cpu_core" + }, + { + "BriefDescription": "number of branch instructions retired that we= re mispredicted and taken forward.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_TAKEN_FWD", + "PublicDescription": "Counts taken forward conditional mispredicte= d branch instructions retired.", + "SampleAfterValue": "400009", + "Unit": "cpu_core" + }, + { + "BriefDescription": "number of branch instructions retired that we= re mispredicted and taken forward. This precise event may be used to get th= e misprediction cost via the Retire_Latency field of PEBS. It fires on the = instruction that immediately follows the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.COND_TAKEN_FWD_COST", + "SampleAfterValue": "400009", + "UMask": "0x8002", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted near indire= ct JMP and near indirect CALL branch instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT", + "SampleAfterValue": "200003", + "UMask": "0xeb", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Miss-predicted near indirect branch instructi= ons retired (excluding returns)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT", + "PublicDescription": "Counts miss-predicted near indirect branch i= nstructions retired excluding returns. TSX abort is an indirect branch.", + "SampleAfterValue": "100003", + "UMask": "0x80", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted near indire= ct CALL branch instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT_CALL", + "SampleAfterValue": "200003", + "UMask": "0xfb", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Mispredicted indirect CALL retired.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT_CALL", + "PublicDescription": "Counts retired mispredicted indirect (near t= aken) CALL instructions, including both register and memory indirect.", + "SampleAfterValue": "400009", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Mispredicted indirect CALL retired. This prec= ise event may be used to get the misprediction cost via the Retire_Latency = field of PEBS. It fires on the instruction that immediately follows the mis= predicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT_CALL_COST", + "SampleAfterValue": "400009", + "UMask": "0x42", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Mispredicted near indirect branch instruction= s retired (excluding returns). This precise event may be used to get the mi= sprediction cost via the Retire_Latency field of PEBS. It fires on the inst= ruction that immediately follows the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT_COST", + "SampleAfterValue": "100003", + "UMask": "0xc0", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted near indire= ct JMP branch instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT_JMP", + "SampleAfterValue": "200003", + "UMask": "0xef", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of mispredicted near taken = branch instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.NEAR_TAKEN", + "SampleAfterValue": "200003", + "UMask": "0x80", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of near branch instructions retired th= at were mispredicted and taken.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.NEAR_TAKEN", + "PublicDescription": "Counts number of near branch instructions re= tired that were mispredicted and taken.", + "SampleAfterValue": "400009", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Mispredicted taken near branch instructions r= etired. This precise event may be used to get the misprediction cost via th= e Retire_Latency field of PEBS. It fires on the instruction that immediatel= y follows the mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.NEAR_TAKEN_COST", + "SampleAfterValue": "400009", + "UMask": "0x60", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This event counts the number of mispredicted = ret instructions retired. Non PEBS", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.RET", + "PublicDescription": "This is a non-precise version (that is, does= not use PEBS) of the event that counts mispredicted return instructions re= tired.", + "SampleAfterValue": "100007", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of mispredicted near RET br= anch instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.RETURN", + "SampleAfterValue": "200003", + "UMask": "0xf7", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Mispredicted ret instructions retired. This p= recise event may be used to get the misprediction cost via the Retire_Laten= cy field of PEBS. It fires on the instruction that immediately follows the = mispredicted branch.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.RET_COST", + "SampleAfterValue": "100007", + "UMask": "0x48", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the total number of BTCLEARS.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe8", + "EventName": "BTCLEAR.ANY", + "PublicDescription": "Counts the total number of BTCLEARS which oc= curs when the Branch Target Buffer (BTB) predicts a taken branch.", + "SampleAfterValue": "1000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Core clocks when the thread is in the C0.1 li= ght-weight slower wakeup time but more power saving optimized state.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xec", + "EventName": "CPU_CLK_UNHALTED.C01", + "PublicDescription": "Counts core clocks when the thread is in the= C0.1 light-weight slower wakeup time but more power saving optimized state= . This state can be entered via the TPAUSE or UMWAIT instructions.", + "SampleAfterValue": "2000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Core clocks when the thread is in the C0.2 li= ght-weight faster wakeup time but less power saving optimized state.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xec", + "EventName": "CPU_CLK_UNHALTED.C02", + "PublicDescription": "Counts core clocks when the thread is in the= C0.2 light-weight faster wakeup time but less power saving optimized state= . This state can be entered via the TPAUSE or UMWAIT instructions.", + "SampleAfterValue": "2000003", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Core clocks when the thread is in the C0.1 or= C0.2 or running a PAUSE in C0 ACPI state.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xec", + "EventName": "CPU_CLK_UNHALTED.C0_WAIT", + "PublicDescription": "Counts core clocks when the thread is in the= C0.1 or C0.2 power saving optimized states (TPAUSE or UMWAIT instructions)= or running the PAUSE instruction.", + "SampleAfterValue": "2000003", + "UMask": "0x70", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of unhalted = core clock cycles.", + "Counter": "Fixed counter 1", + "EventName": "CPU_CLK_UNHALTED.CORE", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Core cycles when the core is not in a halt st= ate.", + "Counter": "Fixed counter 1", + "EventName": "CPU_CLK_UNHALTED.CORE", + "PublicDescription": "Counts the number of core cycles while the c= ore is not in a halt state. The core enters the halt state when it is runni= ng the HLT instruction. This event is a component in many key event ratios.= The core frequency may change from time to time due to transitions associa= ted with Enhanced Intel SpeedStep Technology or TM2. For this reason this e= vent may have a changing ratio with regards to time. When the core frequenc= y is constant, this event can approximate elapsed time while the core was n= ot in the halt state. It is counted on a dedicated fixed counter, leaving t= he programmable counters available for other events.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of unhalted core clock cycl= es [This event is alias to CPU_CLK_UNHALTED.THREAD_P]", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.CORE_P", + "SampleAfterValue": "2000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Thread cycles when thread is not in halt stat= e [This event is alias to CPU_CLK_UNHALTED.THREAD_P]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.CORE_P", + "PublicDescription": "This is an architectural event that counts t= he number of thread cycles while the thread is not in a halt state. The thr= ead enters the halt state when it is running the HLT instruction. The core = frequency may change from time to time due to power or thermal throttling. = For this reason, this event may have a changing ratio with regards to wall = clock time. [This event is alias to CPU_CLK_UNHALTED.THREAD_P]", + "SampleAfterValue": "2000003", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Core clocks when a PAUSE is pending.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xec", + "EventName": "CPU_CLK_UNHALTED.PAUSE", + "SampleAfterValue": "2000003", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of Pause instructions", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0xec", + "EventName": "CPU_CLK_UNHALTED.PAUSE_INST", + "SampleAfterValue": "2000003", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of unhalted = reference clock cycles.", + "Counter": "Fixed counter 2", + "EventName": "CPU_CLK_UNHALTED.REF_TSC", + "SampleAfterValue": "2000003", + "UMask": "0x3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Reference cycles when the core is not in halt= state.", + "Counter": "Fixed counter 2", + "EventName": "CPU_CLK_UNHALTED.REF_TSC", + "PublicDescription": "Counts the number of reference cycles when t= he core is not in a halt state. The core enters the halt state when it is r= unning the HLT instruction or the MWAIT instruction. This event is not affe= cted by core frequency changes (for example, P states, TM2 transitions) but= has the same incrementing frequency as the time stamp counter. This event = can approximate elapsed time while the core was not in a halt state. Note: = On all current platforms this event stops counting during 'throttling (TM)'= states duty off periods the processor is 'halted'. The counter update is = done at a lower clock rate then the core clock the overflow status bit for = this counter may appear 'sticky'. After the counter has overflowed and sof= tware clears the overflow status bit and resets the counter to less than MA= X. The reset value to the counter is not clocked immediately so the overflo= w status bit will flip 'high (1)' and generate another PMI (if enabled) aft= er which the reset value gets clocked into the counter. Therefore, software= will get the interrupt, read the overflow status bit '1 for bit 34 while t= he counter value is less than MAX. Software should ignore this case.", + "SampleAfterValue": "2000003", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of unhalted reference clock= cycles", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.REF_TSC_P", + "PublicDescription": "Counts the number of reference cycles that t= he core is not in a halt state. The core enters the halt state when it is r= unning the HLT instruction. This event is not affected by core frequency ch= anges and increments at a fixed frequency that is also used for the Time St= amp Counter (TSC). This event uses a programmable general purpose performan= ce counter.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Reference cycles when the core is not in halt= state.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.REF_TSC_P", + "PublicDescription": "Counts the number of reference cycles when t= he core is not in a halt state. The core enters the halt state when it is r= unning the HLT instruction or the MWAIT instruction. This event is not affe= cted by core frequency changes (for example, P states, TM2 transitions) but= has the same incrementing frequency as the time stamp counter. This event = can approximate elapsed time while the core was not in a halt state. Note: = On all current platforms this event stops counting during 'throttling (TM)'= states duty off periods the processor is 'halted'. The counter update is = done at a lower clock rate then the core clock the overflow status bit for = this counter may appear 'sticky'. After the counter has overflowed and sof= tware clears the overflow status bit and resets the counter to less than MA= X. The reset value to the counter is not clocked immediately so the overflo= w status bit will flip 'high (1)' and generate another PMI (if enabled) aft= er which the reset value gets clocked into the counter. Therefore, software= will get the interrupt, read the overflow status bit '1 for bit 34 while t= he counter value is less than MAX. Software should ignore this case.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of unhalted = core clock cycles.", + "Counter": "Fixed counter 1", + "EventName": "CPU_CLK_UNHALTED.THREAD", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Core cycles when the thread is not in a halt = state.", + "Counter": "Fixed counter 1", + "EventName": "CPU_CLK_UNHALTED.THREAD", + "PublicDescription": "Counts the number of core cycles while the t= hread is not in a halt state. The thread enters the halt state when it is r= unning the HLT instruction. This event is a component in many key event rat= ios. The core frequency may change from time to time due to transitions ass= ociated with Enhanced Intel SpeedStep Technology or TM2. For this reason th= is event may have a changing ratio with regards to time. When the core freq= uency is constant, this event can approximate elapsed time while the core w= as not in the halt state. It is counted on a dedicated fixed counter, leavi= ng the programmable counters available for other events.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of unhalted core clock cycl= es [This event is alias to CPU_CLK_UNHALTED.CORE_P]", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.THREAD_P", + "SampleAfterValue": "2000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Thread cycles when thread is not in halt stat= e [This event is alias to CPU_CLK_UNHALTED.CORE_P]", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x3c", + "EventName": "CPU_CLK_UNHALTED.THREAD_P", + "PublicDescription": "This is an architectural event that counts t= he number of thread cycles while the thread is not in a halt state. The thr= ead enters the halt state when it is running the HLT instruction. The core = frequency may change from time to time due to power or thermal throttling. = For this reason, this event may have a changing ratio with regards to wall = clock time. [This event is alias to CPU_CLK_UNHALTED.CORE_P]", + "SampleAfterValue": "2000003", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles while memory subsystem has an outstand= ing load.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "16", + "EventCode": "0xa3", + "EventName": "CYCLE_ACTIVITY.CYCLES_MEM_ANY", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total execution stalls.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "4", + "EventCode": "0xa3", + "EventName": "CYCLE_ACTIVITY.STALLS_TOTAL", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles total of 1 uop is executed on all port= s and Reservation Station was not empty.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.1_PORTS_UTIL", + "PublicDescription": "Counts cycles during which a total of 1 uop = was executed on all ports and Reservation Station (RS) was not empty.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles total of 2 or 3 uops are executed on a= ll ports and Reservation Station (RS) was not empty.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.2_3_PORTS_UTIL", + "SampleAfterValue": "2000003", + "UMask": "0xc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles total of 2 uops are executed on all po= rts and Reservation Station was not empty.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.2_PORTS_UTIL", + "PublicDescription": "Counts cycles during which a total of 2 uops= were executed on all ports and Reservation Station (RS) was not empty.", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles total of 3 uops are executed on all po= rts and Reservation Station was not empty.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.3_PORTS_UTIL", + "PublicDescription": "Cycles total of 3 uops are executed on all p= orts and Reservation Station (RS) was not empty.", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles total of 4 uops are executed on all po= rts and Reservation Station was not empty.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.4_PORTS_UTIL", + "PublicDescription": "Cycles total of 4 uops are executed on all p= orts and Reservation Station (RS) was not empty.", + "SampleAfterValue": "2000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Execution stalls while memory subsystem has a= n outstanding load.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "5", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.BOUND_ON_LOADS", + "SampleAfterValue": "2000003", + "UMask": "0x21", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles where the Store Buffer was full and no= loads caused an execution stall.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "2", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.BOUND_ON_STORES", + "PublicDescription": "Counts cycles where the Store Buffer was ful= l and no loads caused an execution stall.", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles no uop executed while RS was not empty= , the SB was not full and there was no outstanding load.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa6", + "EventName": "EXE_ACTIVITY.EXE_BOUND_0_PORTS", + "PublicDescription": "Number of cycles total of 0 uops executed on= all ports, Reservation Station (RS) was not empty, the Store Buffer (SB) w= as not full and there was no outstanding load.", + "SampleAfterValue": "1000003", + "UMask": "0x80", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Instruction decoders utilized in a cycle", + "Counter": "2", + "EventCode": "0x75", + "EventName": "INST_DECODED.DECODERS", + "PublicDescription": "Number of decoders utilized in a cycle when = the MITE (legacy decode pipeline) fetches instructions.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of instructi= ons retired.", + "Counter": "Fixed counter 0", + "EventName": "INST_RETIRED.ANY", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of instructions retired. Fixed Counter= - architectural event", + "Counter": "Fixed counter 0", + "EventName": "INST_RETIRED.ANY", + "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of instructions retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc0", + "EventName": "INST_RETIRED.ANY_P", + "SampleAfterValue": "2000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of instructions retired. General Count= er - architectural event", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc0", + "EventName": "INST_RETIRED.ANY_P", + "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", + "SampleAfterValue": "2000003", + "Unit": "cpu_core" + }, + { + "BriefDescription": "retired macro-fused uops when there is a bran= ch in the macro-fused pair (the two instructions that got macro-fused count= once in this pmon)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc0", + "EventName": "INST_RETIRED.BR_FUSED", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "INST_RETIRED.MACRO_FUSED", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc0", + "EventName": "INST_RETIRED.MACRO_FUSED", + "SampleAfterValue": "2000003", + "UMask": "0x30", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Retired NOP instructions.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc0", + "EventName": "INST_RETIRED.NOP", + "PublicDescription": "Counts all retired NOP or ENDBR32/64 or PREF= ETCHIT0/1 instructions", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Precise instruction retired with PEBS precise= -distribution", + "Counter": "Fixed counter 0", + "EventName": "INST_RETIRED.PREC_DIST", + "PublicDescription": "A version of INST_RETIRED that allows for a = precise distribution of samples across instructions retired. It utilizes th= e Precise Distribution of Instructions Retired (PDIR++) feature to fix bias= in how retired instructions get sampled. Use on Fixed Counter 0.", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Iterations of Repeat string retired instructi= ons.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc0", + "EventName": "INST_RETIRED.REP_ITERATION", + "PublicDescription": "Number of iterations of Repeat (REP) string = retired instructions such as MOVS, CMPS, and SCAS. Each has a byte, word, a= nd doubleword version and string instructions can be repeated using a repet= ition prefix, REP, that allows their architectural execution to be repeated= a number of times as specified by the RCX register. Note the number of ite= rations is implementation-dependent.", + "SampleAfterValue": "2000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Bubble cycles of BPClear.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xad", + "EventName": "INT_MISC.BPCLEAR_CYCLES", + "MSRIndex": "0x3F7", + "MSRValue": "0xB", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Clears speculative count", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0xad", + "EventName": "INT_MISC.CLEARS_COUNT", + "PublicDescription": "Counts the number of speculative clears due = to any type of branch misprediction or machine clears", + "SampleAfterValue": "500009", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts cycles after recovery from a branch mi= sprediction or machine clear till the first uop is issued from the resteere= d path.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xad", + "EventName": "INT_MISC.CLEAR_RESTEER_CYCLES", + "PublicDescription": "Cycles after recovery from a branch mispredi= ction or machine clear till the first uop is issued from the resteered path= .", + "SampleAfterValue": "500009", + "UMask": "0x80", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Core cycles the allocator was stalled due to = recovery from earlier clear event for this thread", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xad", + "EventName": "INT_MISC.RECOVERY_CYCLES", + "PublicDescription": "Counts core cycles when the Resource allocat= or was stalled due to recovery from an earlier branch misprediction or mach= ine clear event.", + "SampleAfterValue": "500009", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Bubble cycles of BAClear (Unknown Branch).", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xad", + "EventName": "INT_MISC.UNKNOWN_BRANCH_CYCLES", + "MSRIndex": "0x3F7", + "MSRValue": "0x7", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "TMA slots where uops got dropped", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xad", + "EventName": "INT_MISC.UOP_DROPPING", + "PublicDescription": "Estimated number of Top-down Microarchitectu= re Analysis slots that got dropped due to non front-end reasons", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of vector integer instructions retired= of 128-bit vector-width.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.128BIT", + "SampleAfterValue": "1000003", + "UMask": "0x13", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of vector integer instructions retired= of 256-bit vector-width.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.256BIT", + "SampleAfterValue": "1000003", + "UMask": "0xac", + "Unit": "cpu_core" + }, + { + "BriefDescription": "integer ADD, SUB, SAD 128-bit vector instruct= ions.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.ADD_128", + "PublicDescription": "Number of retired integer ADD/SUB (regular o= r horizontal), SAD 128-bit vector instructions.", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_core" + }, + { + "BriefDescription": "integer ADD, SUB, SAD 256-bit vector instruct= ions.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.ADD_256", + "PublicDescription": "Number of retired integer ADD/SUB (regular o= r horizontal), SAD 256-bit vector instructions.", + "SampleAfterValue": "1000003", + "UMask": "0xc", + "Unit": "cpu_core" + }, + { + "BriefDescription": "INT_VEC_RETIRED.MUL_256", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.MUL_256", + "SampleAfterValue": "1000003", + "UMask": "0x80", + "Unit": "cpu_core" + }, + { + "BriefDescription": "INT_VEC_RETIRED.SHUFFLES", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.SHUFFLES", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_core" + }, + { + "BriefDescription": "INT_VEC_RETIRED.VNNI_128", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.VNNI_128", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "INT_VEC_RETIRED.VNNI_256", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe7", + "EventName": "INT_VEC_RETIRED.VNNI_256", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of retired loads that are b= locked because it initially appears to be store forward blocked, but subseq= uently is shown not to be blocked based on 4K alias check.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.ADDRESS_ALIAS", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "False dependencies in MOB due to partial comp= are on address.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.ADDRESS_ALIAS", + "PublicDescription": "Counts the number of times a load got blocke= d due to false dependencies in MOB due to partial compare on address.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of occurrences a retired lo= ad was blocked for any of the following reasons: utlb_miss, 4k_alias, unkn= own_sta/bad_fwd, unready_fwd (includes md blocks and esp consuming load blo= cks)", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of occurrences a retired lo= ad gets blocked because its address exactly matches an older store whose da= ta is not ready (a.k.a. unknown). unready_fwd", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.DATA_UNKNOWN", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "The number of times that split load operation= s are temporarily blocked because all resources for handling the split acce= sses are in use.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.NO_SR", + "PublicDescription": "Counts the number of times that split load o= perations are temporarily blocked because all resources for handling the sp= lit accesses are in use.", + "SampleAfterValue": "100003", + "UMask": "0x88", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of occurrences a retired lo= ad gets blocked because its address partially overlaps with an older store = (size mismatch) - unknown_sta/bad_forward", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.STORE_FORWARD", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Loads blocked due to overlapping with a prece= ding store that cannot be forwarded.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.STORE_FORWARD", + "PublicDescription": "Counts the number of times where store forwa= rding was prevented for a load operation. The most common case is a load bl= ocked due to the address of memory access (partially) overlapping with a pr= eceding uncompleted store. Note: See the table of not supported store forwa= rds in the Optimization Guide.", + "SampleAfterValue": "100003", + "UMask": "0x82", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of demand loads that match = on a wcb (request buffer) allocated by an L1 hardware prefetch", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x4c", + "EventName": "LOAD_HIT_PREFETCH.HW_PF", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles Uops delivered by the LSD, but didn't = come from the decoder.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0xa8", + "EventName": "LSD.CYCLES_ACTIVE", + "PublicDescription": "Counts the cycles when at least one uop is d= elivered by the LSD (Loop-stream detector).", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles optimal number of Uops delivered by th= e LSD, but did not come from the decoder.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "8", + "EventCode": "0xa8", + "EventName": "LSD.CYCLES_OK", + "PublicDescription": "Counts the cycles when optimal number of uop= s is delivered by the LSD (Loop-stream detector).", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Number of Uops delivered by the LSD.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa8", + "EventName": "LSD.UOPS", + "PublicDescription": "Counts the number of uops delivered to the b= ack-end by the LSD(Loop Stream Detector).", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts all machine clears for any reason incl= uding, but not limited to memory ordering, SMC, and FP assist.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.ANY", + "SampleAfterValue": "20003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears that flus= h the pipeline and restart the machine without the use of microcode.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.ANY_FAST", + "SampleAfterValue": "20003", + "UMask": "0xff", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of machine clears (nukes) of any type.= ", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EdgeDetect": "1", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.COUNT", + "PublicDescription": "Counts the number of machine clears (nukes) = of any type.", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of memory ordering machine = clears triggered due to an internal load passing an older store within the = same CPU.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.DISAMBIGUATION", + "SampleAfterValue": "20003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears that flus= h the pipeline and restart the machine without the use of microcode.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.DISAMBIGUATION_FAST", + "SampleAfterValue": "20003", + "UMask": "0x88", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of nukes due to memory rena= ming", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.MRN_NUKE", + "SampleAfterValue": "20003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears that flus= h the pipeline and restart the machine without the use of microcode.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.MRN_NUKE_FAST", + "SampleAfterValue": "20003", + "UMask": "0x90", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of times that the machine c= lears due to a page fault. Covers both I-Side and D-Side (Loads/Stores) pa= ge faults. A page fault occurs when either the page is not present, or an = access violation.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.PAGE_FAULT", + "SampleAfterValue": "20003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears that flus= h the pipeline and restart the machine with the use of microcode due to SMC= , MEMORY_ORDERING, FP_ASSISTS, PAGE_FAULT, DISAMBIGUATION, and FPC_VIRTUAL_= TRAP.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.SLOW", + "SampleAfterValue": "20003", + "UMask": "0x6e", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Self-modifying code (SMC) detected.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc3", + "EventName": "MACHINE_CLEARS.SMC", + "PublicDescription": "Counts self-modifying code (SMC) detected, w= hich causes a machine clear.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "LFENCE instructions retired", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe0", + "EventName": "MISC2_RETIRED.LFENCE", + "PublicDescription": "number of LFENCE retired instructions", + "SampleAfterValue": "400009", + "UMask": "0x20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of LBR entries recorded. Re= quires LBRs to be enabled in IA32_LBR_CTL.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe4", + "EventName": "MISC_RETIRED.LBR_INSERTS", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "LBR record is inserted", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xe4", + "EventName": "MISC_RETIRED.LBR_INSERTS", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of CLFLUSH, CLWB, and CLDEM= OTE instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe0", + "EventName": "MISC_RETIRED1.CL_INST", + "SampleAfterValue": "1000003", + "UMask": "0xff", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of LFENCE instructions reti= red.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe0", + "EventName": "MISC_RETIRED1.LFENCE", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of RDPMC, RDTSC, and RDTSCP= instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe0", + "EventName": "MISC_RETIRED1.RDPMC_RDTSC_P", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Count the number of WRMSR instructions retire= d.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe0", + "EventName": "MISC_RETIRED1.WRMSR", + "SampleAfterValue": "1000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of faults and software inte= rrupts with vector < 32.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe1", + "EventName": "MISC_RETIRED2.FAULT_ALL", + "PublicDescription": "Counts the number of faults and software int= errupts with vector < 32, including VOE cases.", + "SampleAfterValue": "1000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of PSB+ nuke events and ToP= A trap events.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe1", + "EventName": "MISC_RETIRED2.INTEL_PT_CLEARS", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of accesses to KeyLocker ca= che.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe1", + "EventName": "MISC_RETIRED2.KEYLOCKER_ACCESS", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of misses to KeyLocker cach= e.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe1", + "EventName": "MISC_RETIRED2.KEYLOCKER_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x11", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of user interrupts delivere= d.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe1", + "EventName": "MISC_RETIRED2.ULI_DELIVERY", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of SENDUIPI instructions re= tired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe1", + "EventName": "MISC_RETIRED2.ULI_SENDUIPI", + "SampleAfterValue": "1000003", + "UMask": "0x9", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of VM exits.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe1", + "EventName": "MISC_RETIRED2.VM_EXIT", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots not consumed= by the backend due to a micro-sequencer (MS) scoreboard, which stalls the = front-end from issuing from the UROM until a specified older uop retires.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x75", + "EventName": "SERIALIZATION.NON_C01_MS_SCB", + "PublicDescription": "Counts the number of issue slots not consume= d by the backend due to a micro-sequencer (MS) scoreboard, which stalls the= front-end from issuing from the UROM until a specified older uop retires. = The most commonly executed instruction with an MS scoreboard is PAUSE.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This event counts a subset of the Topdown Slo= ts event that were not consumed by the back-end pipeline due to lack of bac= k-end resources, as a result of memory subsystem delays, execution units li= mitations, or other conditions.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa4", + "EventName": "TOPDOWN.BACKEND_BOUND_SLOTS", + "PublicDescription": "This event counts a subset of the Topdown Sl= ots event that were not consumed by the back-end pipeline due to lack of ba= ck-end resources, as a result of memory subsystem delays, execution units l= imitations, or other conditions. Software can use this event as the numerat= or for the Backend Bound metric (or top-level category) of the Top-down Mic= roarchitecture Analysis method.", + "SampleAfterValue": "10000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "TMA slots wasted due to incorrect speculation= s.", + "Counter": "0", + "EventCode": "0xa4", + "EventName": "TOPDOWN.BAD_SPEC_SLOTS", + "PublicDescription": "Number of slots of TMA method that were wast= ed due to incorrect speculation. It covers all types of control-flow or dat= a-related mis-speculations.", + "SampleAfterValue": "10000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "TMA slots wasted due to incorrect speculation= by branch mispredictions", + "Counter": "0", + "EventCode": "0xa4", + "EventName": "TOPDOWN.BR_MISPREDICT_SLOTS", + "PublicDescription": "Number of TMA slots that were wasted due to = incorrect speculation by (any type of) branch mispredictions. This event es= timates number of speculative operations that were issued but not retired a= s well as the out-of-order engine recovery past a branch misprediction.", + "SampleAfterValue": "10000003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "TOPDOWN.MEMORY_BOUND_SLOTS", + "Counter": "3", + "EventCode": "0xa4", + "EventName": "TOPDOWN.MEMORY_BOUND_SLOTS", + "SampleAfterValue": "10000003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "TMA slots available for an unhalted logical p= rocessor. Fixed counter - architectural event", + "Counter": "Fixed counter 3", + "EventName": "TOPDOWN.SLOTS", + "PublicDescription": "Number of available slots for an unhalted lo= gical processor. The event increments by machine-width of the narrowest pip= eline as employed by the Top-down Microarchitecture Analysis method (TMA). = Software can use this event as the denominator for the top-level metrics of= the TMA method. This architectural event is counted on a designated fixed = counter (Fixed Counter 3).", + "SampleAfterValue": "10000003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "TMA slots available for an unhalted logical p= rocessor. General counter - architectural event", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xa4", + "EventName": "TOPDOWN.SLOTS_P", + "PublicDescription": "Counts the number of available slots for an = unhalted logical processor. The event increments by machine-width of the na= rrowest pipeline as employed by the Top-down Microarchitecture Analysis met= hod.", + "SampleAfterValue": "10000003", + "UMask": "0x1", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of issue slo= ts not consumed by the backend because allocation is stalled due to a mispr= edicted jump or a machine clear.", + "Counter": "36", + "EventName": "TOPDOWN_BAD_SPECULATION.ALL", + "PublicDescription": "Fixed Counter: Counts the number of issue sl= ots that were not consumed by the backend because allocation is stalled due= to a mispredicted jump or a machine clear. Counts all issue slots blocked= during this recovery window including relevant microcode flows and while u= ops are not yet available in the IQ. Also, includes the issue slots that we= re consumed by the backend but were thrown away because they were younger t= han the mispredict or machine clear.", + "SampleAfterValue": "1000003", + "UMask": "0x5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend because allocation is stalled due to a mispredict= ed jump or a machine clear.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x73", + "EventName": "TOPDOWN_BAD_SPECULATION.ALL_P", + "SampleAfterValue": "1000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to Fast Nukes such as Memory Ord= ering Machine clears and MRN nukes", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x73", + "EventName": "TOPDOWN_BAD_SPECULATION.FASTNUKE", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend because allocation is stalled due to a mach= ine clear (nuke) of any kind including memory ordering and memory disambigu= ation.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x73", + "EventName": "TOPDOWN_BAD_SPECULATION.MACHINE_CLEARS", + "SampleAfterValue": "1000003", + "UMask": "0x3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to Branch Mispredict", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x73", + "EventName": "TOPDOWN_BAD_SPECULATION.MISPREDICT", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to a machine clear (nuke).", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x73", + "EventName": "TOPDOWN_BAD_SPECULATION.NUKE", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retirement slots not con= sumed due to backend stalls [This event is alias to TOPDOWN_BE_BOUND.ALL_P]= ", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xa4", + "EventName": "TOPDOWN_BE_BOUND.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to due to certain allocation rest= rictions", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retirement slots not con= sumed due to backend stalls", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.ALL_NON_ARCH", + "SampleAfterValue": "1000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retirement slots not con= sumed due to backend stalls [This event is alias to TOPDOWN_BE_BOUND.ALL]", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xa4", + "EventName": "TOPDOWN_BE_BOUND.ALL_P", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to memory reservation stall (sche= duler not being able to accept another uop). This could be caused by RSV f= ull or load/store buffer block.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.MEM_SCHEDULER", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to IEC and FPC RAT stalls - which= can be due to the FIQ and IEC reservation station stall (integer, FP and S= IMD scheduler not being able to accept another uop. )", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to mrbl stall. A 'marble' refers= to a physical register file entry, also known as the physical destination = (PDST).", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.REGISTER", + "SampleAfterValue": "1000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to ROB full", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.REORDER_BUFFER", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not consumed by the backend due to iq/jeu scoreboards or ms scb", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x74", + "EventName": "TOPDOWN_BE_BOUND.SERIALIZATION", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of retiremen= t slots not consumed due to front end stalls.", + "Counter": "37", + "EventName": "TOPDOWN_FE_BOUND.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x6", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retirement slots not con= sumed due to front end stalls", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.ALL_NON_ARCH", + "SampleAfterValue": "1000003", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of retirement slots not con= sumed due to front end stalls", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x9c", + "EventName": "TOPDOWN_FE_BOUND.ALL_P", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to BAClear", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.BRANCH_DETECT", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to BTClear", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.BRANCH_RESTEER", + "SampleAfterValue": "1000003", + "UMask": "0x40", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to ms", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.CISC", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to decode stall", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.DECODE", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to frontend bandwidth restricti= ons due to decode, predecode, cisc, and other limitations.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH", + "SampleAfterValue": "1000003", + "UMask": "0x8d", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to latency related stalls inclu= ding BACLEARs, BTCLEARs, ITLB misses, and ICache misses.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.FRONTEND_LATENCY", + "SampleAfterValue": "1000003", + "UMask": "0x72", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to itlb miss", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.ITLB_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend that do not categorize into any oth= er common frontend stall", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.OTHER", + "SampleAfterValue": "1000003", + "UMask": "0x80", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots every cycle = that were not delivered by the frontend due to predecode wrong", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x71", + "EventName": "TOPDOWN_FE_BOUND.PREDECODE", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fixed Counter: Counts the number of consumed = retirement slots.", + "Counter": "38", + "EventName": "TOPDOWN_RETIRING.ALL", + "SampleAfterValue": "1000003", + "UMask": "0x7", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of consumed retirement slot= s.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x72", + "EventName": "TOPDOWN_RETIRING.ALL_NON_ARCH", + "SampleAfterValue": "1000003", + "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the total number of mispredicted branc= h instructions retired for all branch types.", + "BriefDescription": "Counts the number of consumed retirement slot= s.", "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xc5", - "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", - "PEBS": "1", - "PublicDescription": "Counts the total number of mispredicted bran= ch instructions retired. All branch type instructions are accounted for. = Prediction of the branch target address enables the processor to begin exec= uting instructions before the non-speculative execution path is known. The = branch prediction unit (BPU) predicts the target address based on the instr= uction pointer (IP) of the branch and on the execution path through which e= xecution reached this IP. A branch misprediction occurs when the predict= ion is wrong, and results in discarding all instructions executed in the sp= eculative path and re-fetching from the correct path.", - "SampleAfterValue": "200003", + "EventCode": "0xc2", + "EventName": "TOPDOWN_RETIRING.ALL_P", + "SampleAfterValue": "1000003", + "UMask": "0x2", "Unit": "cpu_atom" }, { - "BriefDescription": "All mispredicted branch instructions retired.= ", + "BriefDescription": "Number of non dec-by-all uops decoded by deco= der", "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0xc5", - "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", - "PEBS": "1", - "PublicDescription": "Counts all the retired branch instructions t= hat were mispredicted by the processor. A branch misprediction occurs when = the processor incorrectly predicts the destination of the branch. When the= misprediction is discovered at execution, all the instructions executed in= the wrong (speculative) path must be discarded, and the processor must sta= rt fetching from the correct path.", - "SampleAfterValue": "400009", + "EventCode": "0x76", + "EventName": "UOPS_DECODED.DEC0_UOPS", + "PublicDescription": "This event counts the number of not dec-by-a= ll uops decoded by decoder 0.", + "SampleAfterValue": "1000003", + "UMask": "0x1", "Unit": "cpu_core" }, { - "BriefDescription": "Fixed Counter: Counts the number of unhalted = core clock cycles", - "Counter": "Fixed counter 1", - "EventName": "CPU_CLK_UNHALTED.CORE", - "SampleAfterValue": "2000003", - "UMask": "0x2", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Core cycles when the core is not in a halt st= ate.", - "Counter": "Fixed counter 1", - "EventName": "CPU_CLK_UNHALTED.CORE", - "PublicDescription": "Counts the number of core cycles while the c= ore is not in a halt state. The core enters the halt state when it is runni= ng the HLT instruction. This event is a component in many key event ratios.= The core frequency may change from time to time due to transitions associa= ted with Enhanced Intel SpeedStep Technology or TM2. For this reason this e= vent may have a changing ratio with regards to time. When the core frequenc= y is constant, this event can approximate elapsed time while the core was n= ot in the halt state. It is counted on a dedicated fixed counter, leaving t= he programmable counters available for other events.", + "BriefDescription": "Uops executed on INT EU ALU ports.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.ALU", + "PublicDescription": "Number of ALU integer uops dispatch to execu= tion.", "SampleAfterValue": "2000003", "UMask": "0x2", "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of unhalted core clock cycl= es [This event is alias to CPU_CLK_UNHALTED.THREAD_P]", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0x3c", - "EventName": "CPU_CLK_UNHALTED.CORE_P", + "BriefDescription": "Uops executed on any INT EU ports", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.INT_EU_ALL", + "PublicDescription": "Number of integer uops dispatched to executi= on.", "SampleAfterValue": "2000003", - "Unit": "cpu_atom" + "UMask": "0x1", + "Unit": "cpu_core" }, { - "BriefDescription": "Thread cycles when thread is not in halt stat= e [This event is alias to CPU_CLK_UNHALTED.THREAD_P]", + "BriefDescription": "Number of Uops dispatched/executed by any of = the 3 JEUs (all ups that hold the JEU including macro; micro jumps; fetch-f= rom-eip)", "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0x3c", - "EventName": "CPU_CLK_UNHALTED.CORE_P", - "PublicDescription": "This is an architectural event that counts t= he number of thread cycles while the thread is not in a halt state. The thr= ead enters the halt state when it is running the HLT instruction. The core = frequency may change from time to time due to power or thermal throttling. = For this reason, this event may have a changing ratio with regards to wall = clock time. [This event is alias to CPU_CLK_UNHALTED.THREAD_P]", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.JMP", + "PublicDescription": "Number of jump uops dispatch to execution", "SampleAfterValue": "2000003", + "UMask": "0x40", "Unit": "cpu_core" }, { - "BriefDescription": "Fixed Counter: Counts the number of unhalted = reference clock cycles", - "Counter": "Fixed counter 2", - "EventName": "CPU_CLK_UNHALTED.REF_TSC", + "BriefDescription": "Uops executed on Load ports", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.LOAD", + "PublicDescription": "Number of Load uops dispatched to execution.= ", "SampleAfterValue": "2000003", - "UMask": "0x3", - "Unit": "cpu_atom" + "UMask": "0x4", + "Unit": "cpu_core" }, { - "BriefDescription": "Reference cycles when the core is not in halt= state.", - "Counter": "Fixed counter 2", - "EventName": "CPU_CLK_UNHALTED.REF_TSC", - "PublicDescription": "Counts the number of reference cycles when t= he core is not in a halt state. The core enters the halt state when it is r= unning the HLT instruction or the MWAIT instruction. This event is not affe= cted by core frequency changes (for example, P states, TM2 transitions) but= has the same incrementing frequency as the time stamp counter. This event = can approximate elapsed time while the core was not in a halt state. Note: = On all current platforms this event stops counting during 'throttling (TM)'= states duty off periods the processor is 'halted'. The counter update is = done at a lower clock rate then the core clock the overflow status bit for = this counter may appear 'sticky'. After the counter has overflowed and sof= tware clears the overflow status bit and resets the counter to less than MA= X. The reset value to the counter is not clocked immediately so the overflo= w status bit will flip 'high (1)' and generate another PMI (if enabled) aft= er which the reset value gets clocked into the counter. Therefore, software= will get the interrupt, read the overflow status bit '1 for bit 34 while t= he counter value is less than MAX. Software should ignore this case.", + "BriefDescription": "Number of (shift) 1-cycle Uops dispatched/exe= cuted by any of the Shift Eus", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.SHIFT", + "PublicDescription": "Number of SHIFT integer uops dispatch to exe= cution", "SampleAfterValue": "2000003", - "UMask": "0x3", + "UMask": "0x20", "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of unhalted reference clock= cycles", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0x3c", - "EventName": "CPU_CLK_UNHALTED.REF_TSC_P", - "PublicDescription": "Counts the number of reference cycles that t= he core is not in a halt state. The core enters the halt state when it is r= unning the HLT instruction. This event is not affected by core frequency ch= anges and increments at a fixed frequency that is also used for the Time St= amp Counter (TSC). This event uses a programmable general purpose performan= ce counter.", + "BriefDescription": "Number of Uops dispatched/executed by Slow EU= (e.g. 3+ cycles LEA, >1 cycles shift, iDIVs, CR; *H operation)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.SLOW", + "PublicDescription": "Number of Slow integer uops dispatch to exec= ution.", "SampleAfterValue": "2000003", - "UMask": "0x1", - "Unit": "cpu_atom" + "UMask": "0x8", + "Unit": "cpu_core" }, { - "BriefDescription": "Reference cycles when the core is not in halt= state.", + "BriefDescription": "Number of Uops dispatched on STA ports", "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0x3c", - "EventName": "CPU_CLK_UNHALTED.REF_TSC_P", - "PublicDescription": "Counts the number of reference cycles when t= he core is not in a halt state. The core enters the halt state when it is r= unning the HLT instruction or the MWAIT instruction. This event is not affe= cted by core frequency changes (for example, P states, TM2 transitions) but= has the same incrementing frequency as the time stamp counter. This event = can approximate elapsed time while the core was not in a halt state. Note: = On all current platforms this event stops counting during 'throttling (TM)'= states duty off periods the processor is 'halted'. The counter update is = done at a lower clock rate then the core clock the overflow status bit for = this counter may appear 'sticky'. After the counter has overflowed and sof= tware clears the overflow status bit and resets the counter to less than MA= X. The reset value to the counter is not clocked immediately so the overflo= w status bit will flip 'high (1)' and generate another PMI (if enabled) aft= er which the reset value gets clocked into the counter. Therefore, software= will get the interrupt, read the overflow status bit '1 for bit 34 while t= he counter value is less than MAX. Software should ignore this case.", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.STA", + "PublicDescription": "Number of STA (Store Address) uops dispatch = to execution", "SampleAfterValue": "2000003", - "UMask": "0x1", + "UMask": "0x80", "Unit": "cpu_core" }, { - "BriefDescription": "Fixed Counter: Counts the number of unhalted = core clock cycles", - "Counter": "Fixed counter 1", - "EventName": "CPU_CLK_UNHALTED.THREAD", + "BriefDescription": "Uops executed on STD ports", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xb2", + "EventName": "UOPS_DISPATCHED.STD", + "PublicDescription": "Number of STD (Store Data) uops dispatch to = execution", "SampleAfterValue": "2000003", - "UMask": "0x2", - "Unit": "cpu_atom" + "UMask": "0x10", + "Unit": "cpu_core" }, { - "BriefDescription": "Core cycles when the thread is not in a halt = state.", - "Counter": "Fixed counter 1", - "EventName": "CPU_CLK_UNHALTED.THREAD", - "PublicDescription": "Counts the number of core cycles while the t= hread is not in a halt state. The thread enters the halt state when it is r= unning the HLT instruction. This event is a component in many key event rat= ios. The core frequency may change from time to time due to transitions ass= ociated with Enhanced Intel SpeedStep Technology or TM2. For this reason th= is event may have a changing ratio with regards to time. When the core freq= uency is constant, this event can approximate elapsed time while the core w= as not in the halt state. It is counted on a dedicated fixed counter, leavi= ng the programmable counters available for other events.", + "BriefDescription": "Cycles where at least 1 uop was executed per-= thread", + "Counter": "3", + "CounterMask": "1", + "EventCode": "0xb1", + "EventName": "UOPS_EXECUTED.CYCLES_GE_1", + "PublicDescription": "Cycles where at least 1 uop was executed per= -thread.", "SampleAfterValue": "2000003", - "UMask": "0x2", + "UMask": "0x1", "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of unhalted core clock cycl= es [This event is alias to CPU_CLK_UNHALTED.CORE_P]", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0x3c", - "EventName": "CPU_CLK_UNHALTED.THREAD_P", + "BriefDescription": "Cycles where at least 2 uops were executed pe= r-thread", + "Counter": "3", + "CounterMask": "2", + "EventCode": "0xb1", + "EventName": "UOPS_EXECUTED.CYCLES_GE_2", + "PublicDescription": "Cycles where at least 2 uops were executed p= er-thread.", "SampleAfterValue": "2000003", - "Unit": "cpu_atom" + "UMask": "0x1", + "Unit": "cpu_core" }, { - "BriefDescription": "Thread cycles when thread is not in halt stat= e [This event is alias to CPU_CLK_UNHALTED.CORE_P]", - "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0x3c", - "EventName": "CPU_CLK_UNHALTED.THREAD_P", - "PublicDescription": "This is an architectural event that counts t= he number of thread cycles while the thread is not in a halt state. The thr= ead enters the halt state when it is running the HLT instruction. The core = frequency may change from time to time due to power or thermal throttling. = For this reason, this event may have a changing ratio with regards to wall = clock time. [This event is alias to CPU_CLK_UNHALTED.CORE_P]", + "BriefDescription": "Cycles where at least 3 uops were executed pe= r-thread", + "Counter": "3", + "CounterMask": "3", + "EventCode": "0xb1", + "EventName": "UOPS_EXECUTED.CYCLES_GE_3", + "PublicDescription": "Cycles where at least 3 uops were executed p= er-thread.", "SampleAfterValue": "2000003", + "UMask": "0x1", "Unit": "cpu_core" }, { - "BriefDescription": "Fixed Counter: Counts the number of instructi= ons retired", - "Counter": "Fixed counter 0", - "EventName": "INST_RETIRED.ANY", - "PEBS": "1", + "BriefDescription": "Cycles where at least 4 uops were executed pe= r-thread", + "Counter": "3", + "CounterMask": "4", + "EventCode": "0xb1", + "EventName": "UOPS_EXECUTED.CYCLES_GE_4", + "PublicDescription": "Cycles where at least 4 uops were executed p= er-thread.", "SampleAfterValue": "2000003", "UMask": "0x1", - "Unit": "cpu_atom" + "Unit": "cpu_core" }, { - "BriefDescription": "Number of instructions retired. Fixed Counter= - architectural event", - "Counter": "Fixed counter 0", - "EventName": "INST_RETIRED.ANY", - "PEBS": "1", - "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", + "BriefDescription": "Counts number of cycles no uops were dispatch= ed to be executed on this thread.", + "Counter": "3", + "CounterMask": "1", + "EventCode": "0xb1", + "EventName": "UOPS_EXECUTED.STALLS", + "Invert": "1", + "PublicDescription": "Counts cycles during which no uops were disp= atched from the Reservation Station (RS) per thread.", "SampleAfterValue": "2000003", "UMask": "0x1", "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of instructions retired", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xc0", - "EventName": "INST_RETIRED.ANY_P", - "PEBS": "1", + "BriefDescription": "Counts the number of uops to be executed per-= thread each cycle.", + "Counter": "3", + "EventCode": "0xb1", + "EventName": "UOPS_EXECUTED.THREAD", "SampleAfterValue": "2000003", - "Unit": "cpu_atom" + "UMask": "0x1", + "Unit": "cpu_core" }, { - "BriefDescription": "Number of instructions retired. General Count= er - architectural event", + "BriefDescription": "Counts the number of x87 uops dispatched.", "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0xc0", - "EventName": "INST_RETIRED.ANY_P", - "PEBS": "1", - "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", + "EventCode": "0xb1", + "EventName": "UOPS_EXECUTED.X87", + "PublicDescription": "Counts the number of x87 uops executed.", "SampleAfterValue": "2000003", + "UMask": "0x10", "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of occurrences a retired lo= ad gets blocked because its address partially overlaps with an older store = (size mismatch) - unknown_sta/bad_forward", + "BriefDescription": "When 4-uops are requested and only 2-uops are= delivered, the event counts 2. Uops_issued correlates to the number of RO= B entries. If uop takes 2 ROB slots it counts as 2 uops_issued.", "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0x03", - "EventName": "LD_BLOCKS.STORE_FORWARD", - "PEBS": "1", - "SampleAfterValue": "1000003", - "UMask": "0x2", + "EventCode": "0x0e", + "EventName": "UOPS_ISSUED.ANY", + "SampleAfterValue": "200003", "Unit": "cpu_atom" }, { - "BriefDescription": "Loads blocked due to overlapping with a prece= ding store that cannot be forwarded.", + "BriefDescription": "Uops that RAT issues to RS", "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0x03", - "EventName": "LD_BLOCKS.STORE_FORWARD", - "PublicDescription": "Counts the number of times where store forwa= rding was prevented for a load operation. The most common case is a load bl= ocked due to the address of memory access (partially) overlapping with a pr= eceding uncompleted store. Note: See the table of not supported store forwa= rds in the Optimization Guide.", - "SampleAfterValue": "100003", - "UMask": "0x82", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Counts the number of LBR entries recorded. Re= quires LBRs to be enabled in IA32_LBR_CTL.", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xe4", - "EventName": "MISC_RETIRED.LBR_INSERTS", - "PEBS": "1", - "SampleAfterValue": "1000003", + "EventCode": "0xae", + "EventName": "UOPS_ISSUED.ANY", + "PublicDescription": "Counts the number of uops that the Resource = Allocation Table (RAT) issues to the Reservation Station (RS).", + "SampleAfterValue": "2000003", "UMask": "0x1", - "Unit": "cpu_atom" + "Unit": "cpu_core" }, { - "BriefDescription": "LBR record is inserted", + "BriefDescription": "UOPS_ISSUED.CYCLES", "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0xe4", - "EventName": "MISC_RETIRED.LBR_INSERTS", - "PEBS": "1", - "SampleAfterValue": "1000003", + "CounterMask": "1", + "EventCode": "0xae", + "EventName": "UOPS_ISSUED.CYCLES", + "SampleAfterValue": "2000003", "UMask": "0x1", "Unit": "cpu_core" }, { - "BriefDescription": "This event counts a subset of the Topdown Slo= ts event that were not consumed by the back-end pipeline due to lack of bac= k-end resources, as a result of memory subsystem delays, execution units li= mitations, or other conditions.", - "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0xa4", - "EventName": "TOPDOWN.BACKEND_BOUND_SLOTS", - "PublicDescription": "This event counts a subset of the Topdown Sl= ots event that were not consumed by the back-end pipeline due to lack of ba= ck-end resources, as a result of memory subsystem delays, execution units l= imitations, or other conditions. Software can use this event as the numerat= or for the Backend Bound metric (or top-level category) of the Top-down Mic= roarchitecture Analysis method.", - "SampleAfterValue": "10000003", - "UMask": "0x2", - "Unit": "cpu_core" + "BriefDescription": "Counts the number of uops retired", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.ALL", + "SampleAfterValue": "2000003", + "Unit": "cpu_atom" }, { - "BriefDescription": "TMA slots available for an unhalted logical p= rocessor. Fixed counter - architectural event", - "Counter": "Fixed counter 3", - "EventName": "TOPDOWN.SLOTS", - "PublicDescription": "Number of available slots for an unhalted lo= gical processor. The event increments by machine-width of the narrowest pip= eline as employed by the Top-down Microarchitecture Analysis method (TMA). = Software can use this event as the denominator for the top-level metrics of= the TMA method. This architectural event is counted on a designated fixed = counter (Fixed Counter 3).", - "SampleAfterValue": "10000003", - "UMask": "0x4", + "BriefDescription": "Cycles with retired uop(s).", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.CYCLES", + "PublicDescription": "Counts cycles where at least one uop has ret= ired.", + "SampleAfterValue": "1000003", + "UMask": "0x2", "Unit": "cpu_core" }, { - "BriefDescription": "TMA slots available for an unhalted logical p= rocessor. General counter - architectural event", + "BriefDescription": "Retired uops except the last uop of each inst= ruction.", "Counter": "0,1,2,3,4,5,6,7,8,9", - "EventCode": "0xa4", - "EventName": "TOPDOWN.SLOTS_P", - "PublicDescription": "Counts the number of available slots for an = unhalted logical processor. The event increments by machine-width of the na= rrowest pipeline as employed by the Top-down Microarchitecture Analysis met= hod.", - "SampleAfterValue": "10000003", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.HEAVY", + "PublicDescription": "Counts the number of retired micro-operation= s (uops) except the last uop of each instruction. An instruction that is de= coded into less than two uops does not contribute to the count.", + "SampleAfterValue": "2000003", "UMask": "0x1", "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend because allocation is stalled due to a mispredict= ed jump or a machine clear. [This event is alias to TOPDOWN_BAD_SPECULATION= .ALL_P]", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0x73", - "EventName": "TOPDOWN_BAD_SPECULATION.ALL", - "SampleAfterValue": "1000003", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend because allocation is stalled due to a mispredict= ed jump or a machine clear. [This event is alias to TOPDOWN_BAD_SPECULATION= .ALL]", - "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0x73", - "EventName": "TOPDOWN_BAD_SPECULATION.ALL_P", - "SampleAfterValue": "1000003", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Counts the number of retirement slots not con= sumed due to backend stalls [This event is alias to TOPDOWN_BE_BOUND.ALL_P]= ", + "BriefDescription": "Counts the number of integer divide uops reti= red", "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xa4", - "EventName": "TOPDOWN_BE_BOUND.ALL", - "SampleAfterValue": "1000003", - "UMask": "0x2", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.IDIV", + "SampleAfterValue": "2000003", + "UMask": "0x10", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of retirement slots not con= sumed due to backend stalls [This event is alias to TOPDOWN_BE_BOUND.ALL]", + "BriefDescription": "Counts the number of uops retired that were d= elivered by the loop stream detector (LSD).", "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0xa4", - "EventName": "TOPDOWN_BE_BOUND.ALL_P", - "SampleAfterValue": "1000003", - "UMask": "0x2", - "Unit": "cpu_atom" - }, - { - "BriefDescription": "Fixed Counter: Counts the number of retiremen= t slots not consumed due to front end stalls", - "Counter": "37", - "EventName": "TOPDOWN_FE_BOUND.ALL", - "SampleAfterValue": "1000003", - "UMask": "0x6", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.LSD", + "SampleAfterValue": "2000003", + "UMask": "0x4", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of retirement slots not con= sumed due to front end stalls", + "BriefDescription": "Counts the number of uops that are from the c= omplex flows issued by the micro-sequencer (MS). This includes uops from f= lows due to complex instructions, faults, assists, and inserted flows.", "Counter": "0,1,2,3,4,5,6,7", - "EventCode": "0x9c", - "EventName": "TOPDOWN_FE_BOUND.ALL_P", - "SampleAfterValue": "1000003", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.MS", + "SampleAfterValue": "2000003", "UMask": "0x1", "Unit": "cpu_atom" }, { - "BriefDescription": "Fixed Counter: Counts the number of consumed = retirement slots.", - "Counter": "38", - "EventName": "TOPDOWN_RETIRING.ALL", - "PEBS": "1", - "SampleAfterValue": "1000003", - "UMask": "0x7", - "Unit": "cpu_atom" + "BriefDescription": "UOPS_RETIRED.MS", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.MS", + "MSRIndex": "0x3F7", + "MSRValue": "0x8", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_core" }, { - "BriefDescription": "Counts the number of consumed retirement slot= s.", - "Counter": "0,1,2,3,4,5,6,7", + "BriefDescription": "Number of non-speculative switches to the Mic= rocode Sequencer (MS)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EdgeDetect": "1", "EventCode": "0xc2", - "EventName": "TOPDOWN_RETIRING.ALL_P", - "PEBS": "1", - "SampleAfterValue": "1000003", - "UMask": "0x2", - "Unit": "cpu_atom" + "EventName": "UOPS_RETIRED.MS_SWITCHES", + "MSRIndex": "0x3F7", + "MSRValue": "0x8", + "PublicDescription": "Switches to the Microcode Sequencer", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_core" }, { "BriefDescription": "This event counts a subset of the Topdown Slo= ts event that are utilized by operations that eventually get retired (commi= tted) by the processor pipeline. Usually, this event positively correlates = with higher performance for example, as measured by the instructions-per-c= ycle metric.", @@ -330,5 +2007,26 @@ "SampleAfterValue": "2000003", "UMask": "0x2", "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles without actually retired uops.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.STALLS", + "Invert": "1", + "PublicDescription": "This event counts cycles without actually re= tired uops.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of x87 uops retired, includ= es those in ms flows", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc2", + "EventName": "UOPS_RETIRED.X87", + "SampleAfterValue": "2000003", + "UMask": "0x20", + "Unit": "cpu_atom" } ] diff --git a/tools/perf/pmu-events/arch/x86/lunarlake/uncore-memory.json b/= tools/perf/pmu-events/arch/x86/lunarlake/uncore-memory.json new file mode 100644 index 000000000000..9046b680649d --- /dev/null +++ b/tools/perf/pmu-events/arch/x86/lunarlake/uncore-memory.json @@ -0,0 +1,18 @@ +[ + { + "BriefDescription": "Read CAS command sent to DRAM", + "Counter": "0,1,2,3,4", + "EventCode": "0x22", + "EventName": "UNC_M_CAS_COUNT_RD", + "PerPkg": "1", + "Unit": "iMC" + }, + { + "BriefDescription": "Write CAS command sent to DRAM", + "Counter": "0,1,2,3,4", + "EventCode": "0x23", + "EventName": "UNC_M_CAS_COUNT_WR", + "PerPkg": "1", + "Unit": "iMC" + } +] diff --git a/tools/perf/pmu-events/arch/x86/lunarlake/virtual-memory.json b= /tools/perf/pmu-events/arch/x86/lunarlake/virtual-memory.json index 59af79e3466e..defa3a967754 100644 --- a/tools/perf/pmu-events/arch/x86/lunarlake/virtual-memory.json +++ b/tools/perf/pmu-events/arch/x86/lunarlake/virtual-memory.json @@ -1,4 +1,70 @@ [ + { + "BriefDescription": "Counts the number of page walks initiated by = a demand load that missed the first and second level TLBs.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.MISS_CAUSED_WALK", + "SampleAfterValue": "200003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts walks that miss the PDE_CACHE", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.PDE_CACHE_MISS", + "SampleAfterValue": "200003", + "UMask": "0x80", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of first level TLB misses b= ut second level hits due to a demand load that did not start a page walk. A= ccounts for all page sizes. Will result in a DTLB write from STLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.STLB_HIT", + "SampleAfterValue": "200003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Loads that miss the DTLB and hit the STLB.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x12", + "EventName": "DTLB_LOAD_MISSES.STLB_HIT", + "PublicDescription": "Counts loads that miss the DTLB (Data TLB) a= nd hit the STLB (Second level TLB).", + "SampleAfterValue": "100003", + "UMask": "0x320", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of first level TLB misses b= ut second level hits due to a demand load that did not start a page walk. A= ccount for 4k page size only. Will result in a DTLB write from STLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.STLB_HIT_4K", + "SampleAfterValue": "200003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of first level TLB misses b= ut second level hits due to a demand load that did not start a page walk. A= ccount for large page sizes only. Will result in a DTLB write from STLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.STLB_HIT_LGPG", + "SampleAfterValue": "200003", + "UMask": "0x40", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles when at least one PMH is busy with a p= age walk for a demand load.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x12", + "EventName": "DTLB_LOAD_MISSES.WALK_ACTIVE", + "PublicDescription": "Counts cycles when at least one PMH (Page Mi= ss Handler) is busy with a page walk for a demand load.", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, { "BriefDescription": "Counts the number of page walks completed due= to load DTLB misses to any page size.", "Counter": "0,1,2,3,4,5,6,7", @@ -19,6 +85,125 @@ "UMask": "0xe", "Unit": "cpu_core" }, + { + "BriefDescription": "Page walks completed due to a demand data loa= d to a 1G page.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x12", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_1G", + "PublicDescription": "Counts completed page walks (1G sizes) caus= ed by demand data loads. This implies address translations missed in the DT= LB and further levels of TLB. The page walk can end with or without a fault= .", + "SampleAfterValue": "100003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to load DTLB misses to a 2M or 4M page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M", + "PublicDescription": "Counts the number of page walks completed du= e to loads (including SW prefetches) whose address translations missed in a= ll Translation Lookaside Buffer (TLB) levels and were mapped to 2M or 4M pa= ges. Includes page walks that page fault.", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Page walks completed due to a demand data loa= d to a 2M/4M page.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x12", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M", + "PublicDescription": "Counts completed page walks (2M/4M sizes) c= aused by demand data loads. This implies address translations missed in the= DTLB and further levels of TLB. The page walk can end with or without a fa= ult.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to load DTLB misses to a 4K page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_4K", + "PublicDescription": "Counts the number of page walks completed du= e to loads (including SW prefetches) whose address translations missed in a= ll Translation Lookaside Buffer (TLB) levels and were mapped to 4K pages. I= ncludes page walks that page fault.", + "SampleAfterValue": "200003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Page walks completed due to a demand data loa= d to a 4K page.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x12", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_4K", + "PublicDescription": "Counts completed page walks (4K sizes) caus= ed by demand data loads. This implies address translations missed in the DT= LB and further levels of TLB. The page walk can end with or without a fault= .", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks outstanding f= or Loads (demand or SW prefetch) in PMH every cycle.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.WALK_PENDING", + "PublicDescription": "Counts the number of page walks outstanding = for Loads (demand or SW prefetch) in PMH every cycle. A PMH page walk is o= utstanding from page walk start till PMH becomes idle again (ready to serve= next walk). Includes EPT-walk intervals.", + "SampleAfterValue": "200003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of page walks outstanding for a demand= load in the PMH each cycle.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x12", + "EventName": "DTLB_LOAD_MISSES.WALK_PENDING", + "PublicDescription": "Counts the number of page walks outstanding = for a demand load in the PMH (Page Miss Handler) each cycle.", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks initiated by = a store that missed the first and second level TLBs.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.MISS_CAUSED_WALK", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts walks that miss the PDE_CACHE", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.PDE_CACHE_MISS", + "SampleAfterValue": "2000003", + "UMask": "0x80", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of first level TLB misses b= ut second level hits due to stores that did not start a page walk. Accounts= for all page sizes. Will result in a DTLB write from STLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.STLB_HIT", + "PublicDescription": "Counts the number of first level TLB misses = but second level hits due to a demand load that did not start a page walk. = Accounts for all page sizes. Will result in a DTLB write from STLB.", + "SampleAfterValue": "2000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Stores that miss the DTLB and hit the STLB.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x13", + "EventName": "DTLB_STORE_MISSES.STLB_HIT", + "PublicDescription": "Counts stores that miss the DTLB (Data TLB) = and hit the STLB (2nd Level TLB).", + "SampleAfterValue": "100003", + "UMask": "0x320", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when at least one PMH is busy with a p= age walk for a store.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x13", + "EventName": "DTLB_STORE_MISSES.WALK_ACTIVE", + "PublicDescription": "Counts cycles when at least one PMH (Page Mi= ss Handler) is busy with a page walk for a store.", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, { "BriefDescription": "Counts the number of page walks completed due= to store DTLB misses to any page size.", "Counter": "0,1,2,3,4,5,6,7", @@ -39,6 +224,134 @@ "UMask": "0xe", "Unit": "cpu_core" }, + { + "BriefDescription": "Page walks completed due to a demand data sto= re to a 1G page.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x13", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_1G", + "PublicDescription": "Counts completed page walks (1G sizes) caus= ed by demand data stores. This implies address translations missed in the D= TLB and further levels of TLB. The page walk can end with or without a faul= t.", + "SampleAfterValue": "100003", + "UMask": "0x8", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to store DTLB misses to a 2M or 4M page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_2M_4M", + "PublicDescription": "Counts the number of page walks completed du= e to stores whose address translations missed in all Translation Lookaside = Buffer (TLB) levels and were mapped to 2M or 4M pages. Includes page walks= that page fault.", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Page walks completed due to a demand data sto= re to a 2M/4M page.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x13", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_2M_4M", + "PublicDescription": "Counts completed page walks (2M/4M sizes) c= aused by demand data stores. This implies address translations missed in th= e DTLB and further levels of TLB. The page walk can end with or without a f= ault.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to store DTLB misses to a 4K page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_4K", + "PublicDescription": "Counts the number of page walks completed du= e to stores whose address translations missed in all Translation Lookaside = Buffer (TLB) levels and were mapped to 4K pages. Includes page walks that = page fault.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Page walks completed due to a demand data sto= re to a 4K page.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x13", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_4K", + "PublicDescription": "Counts completed page walks (4K sizes) caus= ed by demand data stores. This implies address translations missed in the D= TLB and further levels of TLB. The page walk can end with or without a faul= t.", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks outstanding i= n the page miss handler (PMH) for stores every cycle.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.WALK_PENDING", + "PublicDescription": "Counts the number of page walks outstanding = in the page miss handler (PMH) for stores every cycle. A PMH page walk is o= utstanding from page walk start till PMH becomes idle again (ready to serve= next walk). Includes EPT-walk intervals.", + "SampleAfterValue": "200003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of page walks outstanding for a store = in the PMH each cycle.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x13", + "EventName": "DTLB_STORE_MISSES.WALK_PENDING", + "PublicDescription": "Counts the number of page walks outstanding = for a store in the PMH (Page Miss Handler) each cycle.", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of times there was an ITLB = miss and a new translation was filled into the ITLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x81", + "EventName": "ITLB.FILLS", + "PublicDescription": "Counts the number of times the machine was u= nable to find a translation in the Instruction Translation Lookaside Buffer= (ITLB) and a new translation was filled into the ITLB. The event is specul= ative in nature, but will not count translations (page walks) that are begu= n and not finished, or translations that are finished but not filled into t= he ITLB.", + "SampleAfterValue": "200003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of page walks initiated by = a instruction fetch that missed the first and second level TLBs.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.MISS_CAUSED_WALK", + "SampleAfterValue": "2000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts walks that miss the PDE_CACHE", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.PDE_CACHE_MISS", + "SampleAfterValue": "2000003", + "UMask": "0x80", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of first level TLB misses b= ut second level hits due to an instruction fetch that did not start a page = walk. Account for all pages sizes. Will result in an ITLB write from STLB.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.STLB_HIT", + "SampleAfterValue": "2000003", + "UMask": "0x20", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instruction fetch requests that miss the ITLB= and hit the STLB.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x11", + "EventName": "ITLB_MISSES.STLB_HIT", + "PublicDescription": "Counts instruction fetch requests that miss = the ITLB (Instruction TLB) and hit the STLB (Second-level TLB).", + "SampleAfterValue": "100003", + "UMask": "0x120", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Cycles when at least one PMH is busy with a p= age walk for code (instruction fetch) request.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "CounterMask": "1", + "EventCode": "0x11", + "EventName": "ITLB_MISSES.WALK_ACTIVE", + "PublicDescription": "Counts cycles when at least one PMH (Page Mi= ss Handler) is busy with a page walk for a code (instruction fetch) request= .", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, { "BriefDescription": "Counts the number of page walks completed due= to instruction fetch misses to any page size.", "Counter": "0,1,2,3,4,5,6,7", @@ -58,5 +371,120 @@ "SampleAfterValue": "100003", "UMask": "0xe", "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to instruction fetch misses to a 2M or 4M page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.WALK_COMPLETED_2M_4M", + "PublicDescription": "Counts the number of page walks completed du= e to instruction fetches whose address translations missed in all Translati= on Lookaside Buffer (TLB) levels and were mapped to 2M or 4M pages. Includ= es page walks that page fault.", + "SampleAfterValue": "2000003", + "UMask": "0x4", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Code miss in all TLB levels causes a page wal= k that completes. (2M/4M)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x11", + "EventName": "ITLB_MISSES.WALK_COMPLETED_2M_4M", + "PublicDescription": "Counts completed page walks (2M/4M page size= s) caused by a code fetch. This implies it missed in the ITLB (Instruction = TLB) and further levels of TLB. The page walk can end with or without a fau= lt.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks completed due= to instruction fetch misses to a 4K page.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.WALK_COMPLETED_4K", + "PublicDescription": "Counts the number of page walks completed du= e to instruction fetches whose address translations missed in all Translati= on Lookaside Buffer (TLB) levels and were mapped to 4K pages. Includes pag= e walks that page fault.", + "SampleAfterValue": "2000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Code miss in all TLB levels causes a page wal= k that completes. (4K)", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x11", + "EventName": "ITLB_MISSES.WALK_COMPLETED_4K", + "PublicDescription": "Counts completed page walks (4K page sizes) = caused by a code fetch. This implies it missed in the ITLB (Instruction TLB= ) and further levels of TLB. The page walk can end with or without a fault.= ", + "SampleAfterValue": "100003", + "UMask": "0x2", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of page walks outstanding f= or iside in PMH every cycle.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x85", + "EventName": "ITLB_MISSES.WALK_PENDING", + "PublicDescription": "Counts the number of page walks outstanding = for iside in PMH every cycle. A PMH page walk is outstanding from page wal= k start till PMH becomes idle again (ready to serve next walk). Includes EP= T-walk intervals. Walks could be counted by edge detecting on this event, = but would count restarted suspended walks.", + "SampleAfterValue": "200003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of page walks outstanding for an outst= anding code request in the PMH each cycle.", + "Counter": "0,1,2,3,4,5,6,7,8,9", + "EventCode": "0x11", + "EventName": "ITLB_MISSES.WALK_PENDING", + "PublicDescription": "Counts the number of page walks outstanding = for an outstanding code (instruction fetch) request in the PMH (Page Miss H= andler) each cycle.", + "SampleAfterValue": "100003", + "UMask": "0x10", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of occurrences a load gets = blocked because of a micro TLB miss", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x03", + "EventName": "LD_BLOCKS.DTLB_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer is stalled due to a DTLB miss", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.DTLB_MISS", + "SampleAfterValue": "1000003", + "UMask": "0x10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the head (ol= dest load) of the load buffer and retirement are both stalled due to a DTLB= miss.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0x05", + "EventName": "LD_HEAD.DTLB_MISS_AT_RET", + "SampleAfterValue": "1000003", + "UMask": "0x90", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of PMH walks that hit in th= e L1 or WCBs", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xbc", + "EventName": "PAGE_WALKER_LOADS.DTLB_L1_HIT", + "SampleAfterValue": "1000003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of PMH walks that hit in th= e L2", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xbc", + "EventName": "PAGE_WALKER_LOADS.DTLB_L2_HIT", + "PublicDescription": "Counts the number of PMH walks that hit in t= he L2. Includes L2 Hit resulting from and L1D eviction of another core in = the same module which is longer latency than a typical L2 hit.", + "SampleAfterValue": "1000003", + "UMask": "0x2", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Count number of any STLB flush attempts (Enti= re, PCID, InvPage, CR3 write, etc)", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xbd", + "EventName": "TLB_FLUSHES.STLB_ANY", + "SampleAfterValue": "20003", + "UMask": "0x20", + "Unit": "cpu_atom" } ] diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 83b8ca32fd1d..2cc2e477df0c 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -22,7 +22,7 @@ GenuineIntel-6-3A,v24,ivybridge,core GenuineIntel-6-3E,v24,ivytown,core GenuineIntel-6-2D,v24,jaketown,core GenuineIntel-6-(57|85),v16,knightslanding,core -GenuineIntel-6-BD,v1.01,lunarlake,core +GenuineIntel-6-BD,v1.10,lunarlake,core GenuineIntel-6-A[AC],v1.10,meteorlake,core GenuineIntel-6-1[AEF],v4,nehalemep,core GenuineIntel-6-2E,v4,nehalemex,core --=20 2.48.0.rc2.279.g1de40edade-goog From nobody Sun Dec 14 13:49:51 2025 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D3B5419ABC2 for ; Thu, 16 Jan 2025 06:44:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009967; cv=none; b=ntk1PpsogsBHp9Y6w6xe3Gm9f0i380Wx6S5C7shpSd4bCU9TVnVstTheUUhppP6i8xh3rsV1Haiw1TGsJZAlcQq4ghaZDGoWmKzMv73hKa2MUlvhnBeRV7yYXNVz64FjWZhdqKZL2CbH+iGux2Q5trknduTLyc8RmLfJfMfu1t4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009967; c=relaxed/simple; bh=qPf06p8bk132PuuDbCIXhXcQeN3KeB/MmdUA1lY/wms=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=dgacVU+osVONmZzwO+ucXG8DnoySfhyoOb6CEf+RGDF8yqCvaxtQ7KaISDjlve+I7clkZn3NzcjL7AbLNsfsvOWs2Xo665CzH+rTKLBROVWe6RllxGj3ahzBeTRxDOR6ATWqMI0bRBPM4kVrJnv4L71+EP8GJdFdzJd9dEAaghU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=BBc+EqQC; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="BBc+EqQC" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e3c61a11a40so1631118276.2 for ; Wed, 15 Jan 2025 22:44:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737009888; x=1737614688; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=A4v81EVbO5rf4105SO+6lN1Yrefp8JQ83InArnzVDws=; b=BBc+EqQCEQQo/7mdDraBZGuuCK9ceG03LckusxlQDCOrf+Z62HHoMZOo65FBvfQn9R VR8kCcUNzoxOH4XwrXU56bPOQw4TNuWpzQ/FQsiRX3/mbeoMTeH5kJfN+f1f2CKBb310 C/v0S8mkLLLxANXdr8F5oMASSgv7MxnD6W9aLLp6vvm9VBL9F01db7BSaDpsTfnDehOy fUbjzbXoVfBTI6KrO+9JtkQ5yzNWf5xrW5XbTwkEXAAr//ZYXfWyqWCVO5cWVHEvH33R kc1bf5MtDMPe9dyJ1sYwWj9+7ed0uTSLBIJOmniX7FSzAseGaswayzU1g4jSg3BiqEhf xOnA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737009888; x=1737614688; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=A4v81EVbO5rf4105SO+6lN1Yrefp8JQ83InArnzVDws=; b=Nc7Je4W4+c7YiWw4ltVKWmacsKQwEUQ3R+1bloA3F26gVnFbEXT1TKm1QE5uninSq8 LA8idwKP/7EZNj4QxW5lb5882EflghK2FBdqQBopLlfwiEYdsgqQMWL0jelJ7Ucl2fVY qqUwxr8Wi1PsBfqVgAbFdRsulKAIvi1gM/P0PasSmlfX3RB4exfQRYgyjD+G/kevn2WW qQF4evuwPHEM3mwJTeudA4BjznbsG0LVD3BPL9POvJ3GNVYqK3cjjTqHj/v8001KzpcI FLuGehYEJkFpp26vLO7xqsYvmKGQl1LiOw8bDT340/0Q4xbDQ8n9hPsXE8xs1b5k7ZPZ 2z8g== X-Forwarded-Encrypted: i=1; AJvYcCVO25goQ6m77SrtgBjcE5IWSmr3gDP0U0f8J64OZOTyscBG/T8BUO8HWGAGMfTdnmS+lJsq3hGRJdXCz/s=@vger.kernel.org X-Gm-Message-State: AOJu0Ywziv8Nh1002cfWqBXigK5Dcef5cBfUWpKmGbiCXomrucx87cUn wRXmk9p+VSZPKzu3yb1Xvf/K6kZEulHlZcOm8oGnDUvkc0JOQP7qdO91zYDziQ4tMBnX4J+iabI jnT1QmQ== X-Google-Smtp-Source: AGHT+IEO0YXx2PNxlsMjiSnWPmtqdpzg3Qz36KW5lmgCw2DzOhqD50LGFUy/PQgtbqK6jM1CZeJiW7woTPzM X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:7bd:ce19:c701:bfa3]) (user=irogers job=sendgmr) by 2002:a25:b11b:0:b0:e57:3e87:8c1b with SMTP id 3f1490d57ef6-e573e878d0cmr41127276.6.1737009888367; Wed, 15 Jan 2025 22:44:48 -0800 (PST) Date: Wed, 15 Jan 2025 22:43:49 -0800 In-Reply-To: <20250116064355.344245-1-irogers@google.com> Message-Id: <20250116064355.344245-18-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250116064355.344245-1-irogers@google.com> X-Mailer: git-send-email 2.48.0.rc2.279.g1de40edade-goog Subject: [PATCH v2 17/23] perf vendor events: Update Meteorlake events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.10 to v1.12. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v1.12: https://github.com/intel/perfmon/commit/d8fe70c91bf8f166ba08edd4d02fd7846a3= fd956 https://github.com/intel/perfmon/commit/b9dabd05ff44af24fde0682e16d1a716c93= 2f0d0 This updates the mapfile.csv for the 0xB5 CPUID variant of meteorlake. https://github.com/intel/perfmon/commit/c3094bc9bbaff30071874a492afc3369554= d572e The TMA 5.01 addition is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-developed-by: Caleb Biggers Signed-off-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- .../pmu-events/arch/x86/meteorlake/cache.json | 109 +- .../arch/x86/meteorlake/frontend.json | 30 +- .../arch/x86/meteorlake/memory.json | 22 +- .../arch/x86/meteorlake/metricgroups.json | 10 +- .../arch/x86/meteorlake/mtl-metrics.json | 3818 +++++++++++++---- .../pmu-events/arch/x86/meteorlake/other.json | 54 + .../arch/x86/meteorlake/pipeline.json | 89 +- 8 files changed, 3265 insertions(+), 869 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 2cc2e477df0c..50304f1056e7 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -23,7 +23,7 @@ GenuineIntel-6-3E,v24,ivytown,core GenuineIntel-6-2D,v24,jaketown,core GenuineIntel-6-(57|85),v16,knightslanding,core GenuineIntel-6-BD,v1.10,lunarlake,core -GenuineIntel-6-A[AC],v1.10,meteorlake,core +GenuineIntel-6-(AA|AC|B5),v1.12,meteorlake,core GenuineIntel-6-1[AEF],v4,nehalemep,core GenuineIntel-6-2E,v4,nehalemex,core GenuineIntel-6-A7,v1.03,rocketlake,core diff --git a/tools/perf/pmu-events/arch/x86/meteorlake/cache.json b/tools/p= erf/pmu-events/arch/x86/meteorlake/cache.json index 908e3c7f6d6e..ce351cd7caaf 100644 --- a/tools/perf/pmu-events/arch/x86/meteorlake/cache.json +++ b/tools/perf/pmu-events/arch/x86/meteorlake/cache.json @@ -92,11 +92,11 @@ "Unit": "cpu_core" }, { - "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache when triggered by an L2 cache fill.", + "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache.", "Counter": "0,1,2,3", "EventCode": "0x26", "EventName": "L2_LINES_OUT.SILENT", - "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache when triggered by an L2 cache fill. These lines are ty= pically in Shared or Exclusive state. A non-threaded event.", + "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache. These lines are typically in Shared or Exclusive stat= e. A non-threaded event.", "SampleAfterValue": "200003", "UMask": "0x1", "Unit": "cpu_core" @@ -399,7 +399,7 @@ "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to an icache or itlb miss which hit in the LLC.", + "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to an ICACHE or ITLB miss which hit in the LLC. If the= core has access to an L3 cache, an LLC hit refers to an L3 cache hit, othe= rwise it counts zeros.", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0x35", "EventName": "MEM_BOUND_STALLS_IFETCH.LLC_HIT", @@ -408,7 +408,7 @@ "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to an icache or itlb miss which missed all the caches.= ", + "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to an ICACHE or ITLB miss which missed all the caches.= If the core has access to an L3 cache, an LLC miss refers to an L3 cache m= iss, otherwise it is an L2 cache miss.", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0x35", "EventName": "MEM_BOUND_STALLS_IFETCH.LLC_MISS", @@ -436,7 +436,7 @@ "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to a demand load miss which hit in the LLC.", + "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to a demand load miss which hit in the LLC. If the cor= e has access to an L3 cache, an LLC hit refers to an L3 cache hit, otherwis= e it counts zeros.", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0x34", "EventName": "MEM_BOUND_STALLS_LOAD.LLC_HIT", @@ -445,7 +445,7 @@ "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to a demand load miss which missed all the local cache= s.", + "BriefDescription": "Counts the number of unhalted cycles when the= core is stalled due to a demand load miss which missed all the local cache= s. If the core has access to an L3 cache, an LLC miss refers to an L3 cache= miss, otherwise it is an L2 cache miss.", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0x34", "EventName": "MEM_BOUND_STALLS_LOAD.LLC_MISS", @@ -459,7 +459,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_LOADS", - "PEBS": "1", "PublicDescription": "Counts all retired load instructions. This e= vent accounts for SW prefetch instructions of PREFETCHNTA or PREFETCHT0/1/2= or PREFETCHW.", "SampleAfterValue": "1000003", "UMask": "0x81", @@ -471,7 +470,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_STORES", - "PEBS": "1", "PublicDescription": "Counts all retired store instructions.", "SampleAfterValue": "1000003", "UMask": "0x82", @@ -483,7 +481,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts all retired memory instructions - loa= ds and stores.", "SampleAfterValue": "1000003", "UMask": "0x83", @@ -495,7 +492,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.LOCK_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with locked= access.", "SampleAfterValue": "100007", "UMask": "0x21", @@ -507,7 +503,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions that split = across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x41", @@ -519,7 +514,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_STORES", - "PEBS": "1", "PublicDescription": "Counts retired store instructions that split= across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x42", @@ -531,7 +525,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_HIT_LOADS", - "PEBS": "1", "PublicDescription": "Number of retired load instructions with a c= lean hit in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x9", @@ -543,7 +536,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_HIT_STORES", - "PEBS": "1", "PublicDescription": "Number of retired store instructions that hi= t in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0xa", @@ -555,7 +547,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_LOADS", - "PEBS": "1", "PublicDescription": "Number of retired load instructions that (st= art a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x11", @@ -567,7 +558,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_STORES", - "PEBS": "1", "PublicDescription": "Number of retired store instructions that (s= tart a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x12", @@ -589,7 +579,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were HitM responses from shared L3.", "SampleAfterValue": "20011", "UMask": "0x4", @@ -601,7 +590,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS", - "PEBS": "1", "PublicDescription": "Counts the retired load instructions whose d= ata sources were L3 hit and cross-core snoop missed in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x1", @@ -613,7 +601,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NONE", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were hits in L3 without snoops required.", "SampleAfterValue": "100003", "UMask": "0x8", @@ -625,7 +612,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were L3 and cross-core snoop hits in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x2", @@ -637,7 +623,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM", - "PEBS": "1", "PublicDescription": "Retired load instructions which data sources= missed L3 but serviced from local DRAM.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -649,7 +634,6 @@ "Data_LA": "1", "EventCode": "0xd4", "EventName": "MEM_LOAD_MISC_RETIRED.UC", - "PEBS": "1", "PublicDescription": "Retired instructions with at least one load = to uncacheable memory-type, or at least one cache-line split locked access = (Bus Lock).", "SampleAfterValue": "100007", "UMask": "0x4", @@ -661,7 +645,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.FB_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop was load missed in L1 but hit FB (Fill Buffers) due to preceding= miss to the same cache line with data not ready.", "SampleAfterValue": "100007", "UMask": "0x40", @@ -673,7 +656,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L1 data cache. This event includes all SW prefet= ches and lock instructions regardless of the data source.", "SampleAfterValue": "1000003", "UMask": "0x1", @@ -685,7 +667,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L1 cache.", "SampleAfterValue": "200003", "UMask": "0x8", @@ -697,7 +678,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with L2 cac= he hits as data sources.", "SampleAfterValue": "200003", "UMask": "0x2", @@ -709,7 +689,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions missed L2 c= ache as data sources.", "SampleAfterValue": "100021", "UMask": "0x10", @@ -721,7 +700,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L3 cache.", "SampleAfterValue": "100021", "UMask": "0x4", @@ -733,7 +711,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L3 cache.", "SampleAfterValue": "50021", "UMask": "0x20", @@ -1027,6 +1004,36 @@ "UMask": "0x42", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts the number of memory uops retired that= missed in the second level TLB.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS", + "SampleAfterValue": "200003", + "UMask": "0x13", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of load uops retired that m= iss in the second Level TLB.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS_LOADS", + "SampleAfterValue": "200003", + "UMask": "0x11", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of store uops retired that = miss in the second level TLB.", + "Counter": "0,1,2,3,4,5,6,7", + "Data_LA": "1", + "EventCode": "0xd0", + "EventName": "MEM_UOPS_RETIRED.STLB_MISS_STORES", + "SampleAfterValue": "200003", + "UMask": "0x12", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts the number of stores uops retired sam= e as MEM_UOPS_RETIRED.ALL_STORES", "Counter": "0,1,2,3,4,5,6,7", @@ -1047,6 +1054,50 @@ "UMask": "0x3", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by the L3 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.L3_HIT", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3F803C0004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by the L3 cache where a snoop w= as sent, the snoop hit, and modified data was forwarded.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.L3_HIT.SNOOP_HITM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10003C0004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by the L3 cache where a snoop w= as sent, the snoop hit, but no data was forwarded.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.L3_HIT.SNOOP_HIT_NO_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x4003C0004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by the L3 cache where a snoop w= as sent, the snoop hit, and non-modified data was forwarded.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.L3_HIT.SNOOP_HIT_WITH_FWD", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x8003C0004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts demand data reads that were supplied b= y the L3 cache.", "Counter": "0,1,2,3,4,5,6,7", diff --git a/tools/perf/pmu-events/arch/x86/meteorlake/frontend.json b/tool= s/perf/pmu-events/arch/x86/meteorlake/frontend.json index b6c52f7385fc..a10614513c8d 100644 --- a/tools/perf/pmu-events/arch/x86/meteorlake/frontend.json +++ b/tools/perf/pmu-events/arch/x86/meteorlake/frontend.json @@ -55,7 +55,6 @@ "EventName": "FRONTEND_RETIRED.ANY_ANT", "MSRIndex": "0x3F7", "MSRValue": "0x9", - "PEBS": "1", "PublicDescription": "Always Not Taken (ANT) conditional retired b= ranches (no BTB entry and not mispredicted)", "SampleAfterValue": "100007", "UMask": "0x3", @@ -68,7 +67,6 @@ "EventName": "FRONTEND_RETIRED.ANY_DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x1", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= DSB (Decode stream buffer i.e. the decoded instruction-cache) miss.", "SampleAfterValue": "100007", "UMask": "0x3", @@ -81,7 +79,6 @@ "EventName": "FRONTEND_RETIRED.DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x11", - "PEBS": "1", "PublicDescription": "Number of retired Instructions that experien= ced a critical DSB (Decode stream buffer i.e. the decoded instruction-cache= ) miss. Critical means stalls were exposed to the back-end as a result of t= he DSB miss.", "SampleAfterValue": "100007", "UMask": "0x3", @@ -103,7 +100,6 @@ "EventName": "FRONTEND_RETIRED.ITLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x14", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= iTLB (Instruction TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x3", @@ -116,7 +112,6 @@ "EventName": "FRONTEND_RETIRED.L1I_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x12", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L1 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x3", @@ -129,7 +124,6 @@ "EventName": "FRONTEND_RETIRED.L2_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x13", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L2 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x3", @@ -142,7 +136,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x600106", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 1 cycle which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x3", @@ -155,7 +148,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_128", "MSRIndex": "0x3F7", "MSRValue": "0x608006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 12= 8 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x3", @@ -168,7 +160,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_16", "MSRIndex": "0x3F7", "MSRValue": "0x601006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 16 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x3", @@ -181,7 +172,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2", "MSRIndex": "0x3F7", "MSRValue": "0x600206", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x3", @@ -194,7 +184,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_256", "MSRIndex": "0x3F7", "MSRValue": "0x610006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 25= 6 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x3", @@ -207,7 +196,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x100206", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after the front-end had at least 1 bubble-slot for a per= iod of 2 cycles. A bubble-slot is an empty issue-pipeline slot while there = was no RAT stall.", "SampleAfterValue": "100007", "UMask": "0x3", @@ -220,7 +208,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_32", "MSRIndex": "0x3F7", "MSRValue": "0x602006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 32 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x3", @@ -233,7 +220,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_4", "MSRIndex": "0x3F7", "MSRValue": "0x600406", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 4 = cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x3", @@ -246,7 +232,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_512", "MSRIndex": "0x3F7", "MSRValue": "0x620006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 51= 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x3", @@ -259,7 +244,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_64", "MSRIndex": "0x3F7", "MSRValue": "0x604006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 64= cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x3", @@ -272,7 +256,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_8", "MSRIndex": "0x3F7", "MSRValue": "0x600806", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 8 cycles. During thi= s period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x3", @@ -285,7 +268,6 @@ "EventName": "FRONTEND_RETIRED.MISP_ANT", "MSRIndex": "0x3F7", "MSRValue": "0x9", - "PEBS": "1", "PublicDescription": "ANT retired branches that got just mispredic= ted", "SampleAfterValue": "100007", "UMask": "0x2", @@ -298,7 +280,6 @@ "EventName": "FRONTEND_RETIRED.MS_FLOWS", "MSRIndex": "0x3F7", "MSRValue": "0x8", - "PEBS": "1", "SampleAfterValue": "100007", "UMask": "0x3", "Unit": "cpu_core" @@ -310,7 +291,6 @@ "EventName": "FRONTEND_RETIRED.STLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x15", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= STLB (2nd level TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x3", @@ -323,7 +303,6 @@ "EventName": "FRONTEND_RETIRED.UNKNOWN_BRANCH", "MSRIndex": "0x3F7", "MSRValue": "0x17", - "PEBS": "1", "SampleAfterValue": "100007", "UMask": "0x3", "Unit": "cpu_core" @@ -539,5 +518,14 @@ "SampleAfterValue": "1000003", "UMask": "0x1", "Unit": "cpu_core" + }, + { + "BriefDescription": "Counts the number of cycles that the micro-se= quencer is busy.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xe7", + "EventName": "MS_DECODED.MS_BUSY", + "SampleAfterValue": "1000003", + "UMask": "0x4", + "Unit": "cpu_atom" } ] diff --git a/tools/perf/pmu-events/arch/x86/meteorlake/memory.json b/tools/= perf/pmu-events/arch/x86/meteorlake/memory.json index b464a8ab32ca..e4481fbc1e13 100644 --- a/tools/perf/pmu-events/arch/x86/meteorlake/memory.json +++ b/tools/perf/pmu-events/arch/x86/meteorlake/memory.json @@ -143,7 +143,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_1024", "MSRIndex": "0x3F6", "MSRValue": "0x400", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 1024 cycles. Reporte= d latency may be longer than just the memory latency.", "SampleAfterValue": "53", "UMask": "0x1", @@ -157,7 +156,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_128", "MSRIndex": "0x3F6", "MSRValue": "0x80", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 128 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "1009", "UMask": "0x1", @@ -171,7 +169,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_16", "MSRIndex": "0x3F6", "MSRValue": "0x10", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 16 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "20011", "UMask": "0x1", @@ -185,7 +182,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_2048", "MSRIndex": "0x3F6", "MSRValue": "0x800", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 2048 cycles. Reporte= d latency may be longer than just the memory latency.", "SampleAfterValue": "23", "UMask": "0x1", @@ -199,7 +195,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_256", "MSRIndex": "0x3F6", "MSRValue": "0x100", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 256 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "503", "UMask": "0x1", @@ -213,7 +208,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32", "MSRIndex": "0x3F6", "MSRValue": "0x20", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 32 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "100007", "UMask": "0x1", @@ -227,7 +221,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4", "MSRIndex": "0x3F6", "MSRValue": "0x4", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 4 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "100003", "UMask": "0x1", @@ -241,7 +234,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_512", "MSRIndex": "0x3F6", "MSRValue": "0x200", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 512 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "101", "UMask": "0x1", @@ -255,7 +247,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64", "MSRIndex": "0x3F6", "MSRValue": "0x40", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 64 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "2003", "UMask": "0x1", @@ -269,7 +260,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_8", "MSRIndex": "0x3F6", "MSRValue": "0x8", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 8 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "50021", "UMask": "0x1", @@ -281,7 +271,6 @@ "Data_LA": "1", "EventCode": "0xcd", "EventName": "MEM_TRANS_RETIRED.STORE_SAMPLE", - "PEBS": "2", "PublicDescription": "Counts Retired memory accesses with at least= 1 store operation. This PEBS event is the precisely-distributed (PDist) tr= igger covering all stores uops for sampling by the PEBS Store Latency Facil= ity. The facility is described in Intel SDM Volume 3 section 19.9.8", "SampleAfterValue": "1000003", "UMask": "0x2", @@ -305,6 +294,17 @@ "UMask": "0x4", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were not supplied by the L3 cache.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.L3_MISS", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x3FBFC00004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts demand data reads that were not suppli= ed by the L3 cache.", "Counter": "0,1,2,3,4,5,6,7", diff --git a/tools/perf/pmu-events/arch/x86/meteorlake/metricgroups.json b/= tools/perf/pmu-events/arch/x86/meteorlake/metricgroups.json index b54a5fc0861f..855585fe6fae 100644 --- a/tools/perf/pmu-events/arch/x86/meteorlake/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/meteorlake/metricgroups.json @@ -41,6 +41,7 @@ "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", "Load_Store_Miss": "Grouping from Top-down Microarchitecture Analysis = Metrics spreadsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -89,7 +90,9 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_mispredicts_group": "Metrics contributing to tma_branch_mi= spredicts category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", + "tma_code_stlb_miss_group": "Metrics contributing to tma_code_stlb_mis= s category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -99,6 +102,7 @@ "tma_fp_vector_group": "Metrics contributing to tma_fp_vector category= ", "tma_frontend_bound_group": "Metrics contributing to tma_frontend_boun= d category", "tma_heavy_operations_group": "Metrics contributing to tma_heavy_opera= tions category", + "tma_icache_misses_group": "Metrics contributing to tma_icache_misses = category", "tma_ifetch_bandwidth_group": "Metrics contributing to tma_ifetch_band= width category", "tma_ifetch_latency_group": "Metrics contributing to tma_ifetch_latenc= y category", "tma_int_operations_group": "Metrics contributing to tma_int_operation= s category", @@ -121,10 +125,13 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", + "tma_l2_bound_group": "Metrics contributing to tma_l2_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", "tma_load_op_utilization_group": "Metrics contributing to tma_load_op_= utilization category", + "tma_load_stlb_miss_group": "Metrics contributing to tma_load_stlb_mis= s category", "tma_machine_clears_group": "Metrics contributing to tma_machine_clear= s category", "tma_mem_latency_group": "Metrics contributing to tma_mem_latency cate= gory", "tma_memory_bound_group": "Metrics contributing to tma_memory_bound ca= tegory", @@ -138,5 +145,6 @@ "tma_retiring_group": "Metrics contributing to tma_retiring category", "tma_serializing_operation_group": "Metrics contributing to tma_serial= izing_operation category", "tma_store_bound_group": "Metrics contributing to tma_store_bound cate= gory", - "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category" + "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category", + "tma_store_stlb_miss_group": "Metrics contributing to tma_store_stlb_m= iss category" } diff --git a/tools/perf/pmu-events/arch/x86/meteorlake/mtl-metrics.json b/t= ools/perf/pmu-events/arch/x86/meteorlake/mtl-metrics.json index 1a7392f0da86..0707f0fe6163 100644 --- a/tools/perf/pmu-events/arch/x86/meteorlake/mtl-metrics.json +++ b/tools/perf/pmu-events/arch/x86/meteorlake/mtl-metrics.json @@ -84,54 +84,219 @@ "MetricName": "smi_num", "ScaleUnit": "1SMI#" }, + { + "BriefDescription": "Uncore frequency per die [GHZ]", + "MetricExpr": "tma_info_system_socket_clks / #num_dies / duration_= time / 1e9", + "MetricGroup": "SoC", + "MetricName": "UNCORE_FREQ", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to certain allocation restrictions", - "MetricExpr": "tma_core_bound", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS@ / (6 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_allocation_restriction", - "MetricThreshold": "tma_allocation_restriction > 0.1 & (tma_core_b= ound > 0.1 & tma_backend_bound > 0.1)", + "MetricThreshold": "(tma_allocation_restriction >0.10) & ((tma_cor= e_bound >0.10) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + = UOPS_DISPATCHED.PORT_5_11 + UOPS_DISPATCHED.PORT_6) / (5 * tma_info_core_co= re_clks)", + "MetricGroup": "Core_Execution;TopdownL5;tma_L5_group;tma_ports_ut= ilized_3m_group", + "MetricName": "tma_alu_op_utilization", + "MetricThreshold": "tma_alu_op_utilization > 0.4", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops delivered by the Microcode_Sequencer as a result of Assists= ", + "MetricExpr": "78 * ASSISTS.ANY / tma_info_thread_slots", + "MetricGroup": "BvIO;Slots_Estimated;TopdownL4;tma_L4_group;tma_mi= crocode_sequencer_group", + "MetricName": "tma_assists", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists", + "MetricExpr": "63 * ASSISTS.SSE_AVX_MIX / tma_info_thread_slots", + "MetricGroup": "HPC;Slots_Estimated;TopdownL5;tma_L5_group;tma_ass= ists_group", + "MetricName": "tma_avx_assists", + "MetricThreshold": "tma_avx_assists > 0.1", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend due to backend stalls", - "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALL_P@ / (6 * cpu_atom@CP= U_CLK_UNHALTED.CORE@)", + "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", + "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", "MetricGroup": "TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", - "MetricThreshold": "tma_backend_bound > 0.1", + "MetricThreshold": "tma_backend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend due to backend stalls. Note that uops must= be available for consumption in order for this event to count. If a uop is= not available (IQ is empty), this event will not count", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend because allocation is stalled due to a misp= redicted jump or a machine clear", - "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.ALL_P@ / (6 * cpu_= atom@CPU_CLK_UNHALTED.CORE@)", + "BriefDescription": "This category represents fraction of slots wa= sted due to incorrect speculations", + "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + t= ma_retiring), 0)", "MetricGroup": "TopdownL1;tma_L1_group", "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend because allocation is stalled due to a mis= predicted jump or a machine clear. Only issue slots wasted due to fast nuke= s such as memory ordering nukes are counted. Other nukes are not accounted = for. Counts all issue slots blocked during this recovery window including r= elevant microcode flows and while uops are not yet available in the instruc= tion queue (IQ). Also includes the issue slots that were consumed by the ba= ckend but were thrown away because they were younger than the mispredict or= machine clear.", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%", "Unit": "cpu_atom" }, + { + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Default;Fed;Frontend;IcMiss;Memo= ryTLB;Scaled_Slots;TopdownL1;tma_L1_group", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Default;Ret;Scaled_Slots;TopdownL1;tma_L1_gro= up", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_fb_full)= ))", + "MetricGroup": "BvMB;Default;Mem;MemoryBW;Offcore;Scaled_Slots;Top= downL1;tma_L1_group;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_mem_b= andwidth, tma_sq_full", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_= bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_d= ram_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtlb_load + tma_st= ore_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_load= s + tma_fb_full)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_= l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_split_l= oads / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma= _lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bound * (tma_s= tore_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_split_stores / (tma_store_latency + tma_false_sha= ring + tma_split_stores + tma_streaming_stores + tma_dtlb_store)) + tma_mem= ory_bound * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound = + tma_dram_bound + tma_store_bound)) * (tma_store_latency / (tma_store_late= ncy + tma_false_sharing + tma_split_stores + tma_streaming_stores + tma_dtl= b_store)))", + "MetricGroup": "BvML;Default;Mem;MemoryLat;Offcore;Scaled_Slots;To= pdownL1;tma_L1_group;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;Default;Scaled_Slots;TopdownL1;tma_L1_gro= up;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D0x1@) * (t= ma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_res= teers + tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispr= edicts) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_bra= nches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma_= ms_switches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms / = (tma_mite + tma_dsb + tma_lsd + tma_ms))) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Default;Fed;FetchBW;Frontend;Scaled_Slots;Top= downL1;tma_L1_group", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu_core@U= OPS_RETIRED.MS\\,cmask\\=3D0x1@) * (tma_fetch_latency * (tma_ms_switches + = tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma= _other_mispredicts / tma_branch_mispredicts) / (tma_mispredicts_resteers + = tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_itl= b_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switch= es) + tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_lsd + tma_ms= )) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / t= ma_other_nukes + tma_core_bound * (tma_serializing_operation + RS.EMPTY_RES= OURCE / tma_info_thread_clks * tma_ports_utilized_0) / (tma_divider + tma_s= erializing_operation + tma_ports_utilization) + tma_microcode_sequencer / (= tma_few_uops_instructions + tma_microcode_sequencer) * (tma_assists / tma_m= icrocode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Default;Ret;Scaled_Slots;TopdownL1;tm= a_L1_group;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (= tma_dtlb_load / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_depende= ncy + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bound= * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dra= m_bound + tma_store_bound)) * (tma_dtlb_store / (tma_store_latency + tma_fa= lse_sharing + tma_split_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Default;Mem;MemoryTLB;Offcore;Scaled_Slots;To= pdownL1;tma_L1_group;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;Default;LockCont;Mem;Offcore;Scaled_Slots;Top= downL1;tma_L1_group;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;Default;Scaled_Slot= s;TopdownL1;tma_L1_group;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Default;Offcore;Scaled_Slots;TopdownL1;tm= a_L1_group", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "DefaultMetricgroupName": "TopdownL1", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Default;Ret;Scaled_Slots;TopdownL1;tma_L1_gro= up", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20", + "MetricgroupNoGroup": "TopdownL1;Default", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to BACLEARS, which occurs when the Branch T= arget Buffer (BTB) prediction or lack thereof, was corrected by a later bra= nch predictor in the frontend", "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.BRANCH_DETECT@ / (6 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_branch_detect", - "MetricThreshold": "tma_branch_detect > 0.05 & (tma_ifetch_latency= > 0.15 & tma_frontend_bound > 0.2)", + "MetricThreshold": "(tma_branch_detect >0.05) & ((tma_ifetch_laten= cy >0.15) & ((tma_frontend_bound >0.20)))", "PublicDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to BACLEARS, which occurs when the Branch = Target Buffer (BTB) prediction or lack thereof, was corrected by a later br= anch predictor in the frontend. Includes BACLEARS due to all branch types i= ncluding conditional and unconditional jumps, returns, and indirect branche= s.", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to branch mispredicts", - "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.MISPREDICT@ / (6 *= cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", + "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * sl= ots", "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", "MetricName": "tma_branch_mispredicts", - "MetricThreshold": "tma_branch_mispredicts > 0.05 & tma_bad_specul= ation > 0.15", + "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost,= tma_mispredicts_resteers", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -140,563 +305,2402 @@ "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.BRANCH_RESTEER@ / (6 * cp= u_atom@CPU_CLK_UNHALTED.CORE@)", "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", "MetricName": "tma_branch_resteer", - "MetricThreshold": "tma_branch_resteer > 0.05 & (tma_ifetch_latenc= y > 0.15 & tma_frontend_bound > 0.2)", + "MetricThreshold": "(tma_branch_resteer >0.05) & ((tma_ifetch_late= ncy >0.15) & ((tma_frontend_bound >0.20)))", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to the microcode sequencer (MS).", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.CISC@ / (6 * cpu_atom@CPU= _CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", - "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.05 & (tma_ifetch_bandwidth > 0.1 = & tma_frontend_bound > 0.2)", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers", + "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", + "MetricGroup": "Clocks;FetchLat;TopdownL3;tma_L3_group;tma_fetch_l= atency_group;tma_overlap", + "MetricName": "tma_branch_resteers", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of cycles due to backend bo= und stalls that are bounded by core restrictions and not attributed to an o= utstanding load or stores, or resource limitation", - "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS@ / (6 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_core_bound", - "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.1= ", - "MetricgroupNoGroup": "TopdownL2", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C01 / tma_info_thread_clks", + "MetricGroup": "C0Wait;Clocks;TopdownL4;tma_L4_group;tma_serializi= ng_operation_group", + "MetricName": "tma_c01_wait", + "MetricThreshold": "tma_c01_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to decode stalls.", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.DECODE@ / (6 * cpu_atom@C= PU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", - "MetricName": "tma_decode", - "MetricThreshold": "tma_decode > 0.05 & (tma_ifetch_bandwidth > 0.= 1 & tma_frontend_bound > 0.2)", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C02 / tma_info_thread_clks", + "MetricGroup": "C0Wait;Clocks;TopdownL4;tma_L4_group;tma_serializi= ng_operation_group", + "MetricName": "tma_c02_wait", + "MetricThreshold": "tma_c02_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that does not require the = use of microcode, classified as a fast nuke, due to memory ordering, memory= disambiguation and memory renaming", - "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.FASTNUKE@ / (6 * c= pu_atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", - "MetricName": "tma_fast_nuke", - "MetricThreshold": "tma_fast_nuke > 0.05 & (tma_machine_clears > 0= .05 & tma_bad_speculation > 0.15)", + "BriefDescription": "This metric estimates fraction of cycles the = CPU retired uops originated from CISC (complex instruction set computer) in= struction", + "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_cisc", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources. Sample with: FRONTEND_RETIRE= D.MS_FLOWS", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to frontend stalls.", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.ALL_P@ / (6 * cpu_atom@CP= U_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL1;tma_L1_group", - "MetricName": "tma_frontend_bound", - "MetricThreshold": "tma_frontend_bound > 0.2", - "MetricgroupNoGroup": "TopdownL1", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Machine Clears", + "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", + "MetricGroup": "BadSpec;Clocks;MachineClears;TopdownL4;tma_L4_grou= p;tma_branch_resteers_group;tma_issueMC", + "MetricName": "tma_clears_resteers", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to instruction cache misses.", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.ICACHE@ / (6 * cpu_atom@C= PU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", - "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_ifetch_latency= > 0.15 & tma_frontend_bound > 0.2)", + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "MetricExpr": "max(0, FRONTEND_RETIRED.L1I_MISS * cpu_core@FRONTEN= D_RETIRED.L1I_MISS@R / tma_info_thread_clks - tma_code_l2_miss)", + "MetricGroup": "Clocks_Retired;FetchLat;IcMiss;Offcore;TopdownL4;t= ma_L4_group;tma_icache_misses_group", + "MetricName": "tma_code_l2_hit", + "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend bandwidth restrictions due to d= ecode, predecode, cisc, and other limitations.", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH@ / (6 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_ifetch_bandwidth", - "MetricThreshold": "tma_ifetch_bandwidth > 0.1 & tma_frontend_boun= d > 0.2", - "MetricgroupNoGroup": "TopdownL2", + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "MetricExpr": "FRONTEND_RETIRED.L2_MISS * cpu_core@FRONTEND_RETIRE= D.L2_MISS@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;FetchLat;IcMiss;Offcore;TopdownL4;t= ma_L4_group;tma_icache_misses_group", + "MetricName": "tma_code_l2_miss", + "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend latency restrictions due to ica= che misses, itlb misses, branch detection, and resteer limitations.", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_LATENCY@ / (6 * = cpu_atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", - "MetricName": "tma_ifetch_latency", - "MetricThreshold": "tma_ifetch_latency > 0.15 & tma_frontend_bound= > 0.2", - "MetricgroupNoGroup": "TopdownL2", + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, FRONTEND_RETIRED.ITLB_MISS * cpu_core@FRONTE= ND_RETIRED.ITLB_MISS@R / tma_info_thread_clks - tma_code_stlb_miss)", + "MetricGroup": "Clocks_Retired;FetchLat;MemoryTLB;TopdownL4;tma_L4= _group;tma_itlb_misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per Floating Point (FP) Operatio= n", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@FP_FLOPS_RETI= RED.ALL@", - "MetricGroup": "Flops", - "MetricName": "tma_info_arith_inst_mix_ipflop", + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "FRONTEND_RETIRED.STLB_MISS * cpu_core@FRONTEND_RETI= RED.STLB_MISS@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;FetchLat;MemoryTLB;TopdownL4;tma_L4= _group;tma_itlb_misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / (cpu_atom@FP_INST_RETI= RED.128B_DP@ + cpu_atom@FP_INST_RETIRED.128B_SP@)", - "MetricGroup": "Flops", - "MetricName": "tma_info_arith_inst_mix_ipfparith_avx128", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks * IT= LB_MISSES.WALK_COMPLETED_2M_4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISS= ES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "Clocks_Estimated;FetchLat;MemoryTLB;TopdownL5;tma_= L5_group;tma_code_stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@FP_INST_RETIR= ED.64B_DP@", - "MetricGroup": "Flops", - "MetricName": "tma_info_arith_inst_mix_ipfparith_scalar_dp", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks * IT= LB_MISSES.WALK_COMPLETED_4K / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.= WALK_COMPLETED_2M_4M)", + "MetricGroup": "Clocks_Estimated;FetchLat;MemoryTLB;TopdownL5;tma_= L5_group;tma_code_stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@FP_INST_RETIR= ED.32B_SP@", - "MetricGroup": "Flops", - "MetricName": "tma_info_arith_inst_mix_ipfparith_scalar_sp", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by non-taken conditional bran= ches", + "MetricExpr": "BR_MISP_RETIRED.COND_NTAKEN_COST * cpu_core@BR_MISP= _RETIRED.COND_NTAKEN_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_cond_nt_mispredicts", + "MetricThreshold": "tma_cond_nt_mispredicts > 0.05 & tma_branch_mi= spredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of time that retirement is stalled= due to a first level data TLB miss", - "MetricExpr": "100 * (cpu_atom@LD_HEAD.DTLB_MISS_AT_RET@ + cpu_ato= m@LD_HEAD.PGWALK_AT_RET@) / cpu_atom@CPU_CLK_UNHALTED.CORE@", - "MetricName": "tma_info_bottleneck_%_dtlb_miss_bound_cycles", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by taken conditional branches", + "MetricExpr": "BR_MISP_RETIRED.COND_TAKEN_COST * cpu_core@BR_MISP_= RETIRED.COND_TAKEN_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_cond_tk_mispredicts", + "MetricThreshold": "tma_cond_tk_mispredicts > 0.05 & tma_branch_mi= spredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of time that allocation and retire= ment is stalled by the Frontend Cluster due to an Ifetch Miss, either Icach= e or ITLB Miss", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.ALL@ / cpu_a= tom@CPU_CLK_UNHALTED.CORE@", - "MetricGroup": "Ifetch", - "MetricName": "tma_info_bottleneck_%_ifetch_miss_bound_cycles", - "PublicDescription": "Percentage of time that allocation and retir= ement is stalled by the Frontend Cluster due to an Ifetch Miss, either Icac= he or ITLB Miss. See Info.Ifetch_Bound", + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", + "MetricExpr": "((min(MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS * cpu_core@= MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS * (2= 7 * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequency) i= f 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@R else MEM_LOAD_L3_HIT_RET= IRED.XSNP_MISS * (27 * tma_info_system_core_frequency) - 3 * tma_info_syste= m_core_frequency) + (min(MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * cpu_core@MEM_LO= AD_L3_HIT_RETIRED.XSNP_FWD@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (28 * tma_= info_system_core_frequency) - 3 * tma_info_system_core_frequency) if 0 < cp= u_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@R else MEM_LOAD_L3_HIT_RETIRED.XSNP= _FWD * (28 * tma_info_system_core_frequency) - 3 * tma_info_system_core_fre= quency) * (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HI= T.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_L= OAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "BvMS;Clocks_Estimated;DataSharing;LockCont;Offcore= ;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_contested_accesses", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false= _sharing, tma_machine_clears", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of time that retirement is stalled= due to an L1 miss", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_LOAD.ALL@ / cpu_ato= m@CPU_CLK_UNHALTED.CORE@", - "MetricGroup": "Load_Store_Miss", - "MetricName": "tma_info_bottleneck_%_load_miss_bound_cycles", - "PublicDescription": "Percentage of time that retirement is stalle= d due to an L1 miss. See Info.Load_Miss_Bound", + "BriefDescription": "This metric represents fraction of slots wher= e Core non-memory issues were of a bottleneck", + "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)", + "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_core_bound", + "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of time that retirement is stalled= by the Memory Cluster due to a pipeline stall", - "MetricExpr": "100 * cpu_atom@LD_HEAD.ANY_AT_RET@ / cpu_atom@CPU_C= LK_UNHALTED.CORE@", - "MetricGroup": "Mem_Exec", - "MetricName": "tma_info_bottleneck_%_mem_exec_bound_cycles", - "PublicDescription": "Percentage of time that retirement is stalle= d by the Memory Cluster due to a pipeline stall. See Info.Mem_Exec_Bound", + "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", + "MetricExpr": "((min(MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD * cpu_cor= e@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FW= D * (27 * tma_info_system_core_frequency) - 3 * tma_info_system_core_freque= ncy) if 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@R else MEM_LOAD_L3= _HIT_RETIRED.XSNP_NO_FWD * (27 * tma_info_system_core_frequency) - 3 * tma_= info_system_core_frequency) + (min(MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * cpu_c= ore@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * = (27 * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequency)= if 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@R else MEM_LOAD_L3_HIT_RE= TIRED.XSNP_FWD * (27 * tma_info_system_core_frequency) - 3 * tma_info_syste= m_core_frequency) * (1 - OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND= _DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))= ) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info= _thread_clks", + "MetricGroup": "BvMS;Clocks_Estimated;Offcore;Snoop;TopdownL4;tma_= L4_group;tma_issueSyncxn;tma_l3_bound_group", + "MetricName": "tma_data_sharing", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.ALL_BRANCHES@", - "MetricName": "tma_info_br_inst_mix_ipbranch", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to decode stalls.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.DECODE@ / (6 * cpu_atom@C= PU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_decode", + "MetricThreshold": "(tma_decode >0.05) & ((tma_ifetch_bandwidth >0= .10) & ((tma_frontend_bound >0.20)))", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instruction per (near) call (lower number mea= ns higher occurrence rate)", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.NEAR_CALL@", - "MetricName": "tma_info_br_inst_mix_ipcall", + "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", + "MetricExpr": "(cpu_core@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - = cpu_core@INST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks = / 2", + "MetricGroup": "DSBmiss;FetchBW;Slots_Estimated;TopdownL4;tma_L4_g= roup;tma_issueD0;tma_mite_group", + "MetricName": "tma_decoder0_alone", + "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.FAR_BRANCH@u", - "MetricName": "tma_info_br_inst_mix_ipfarbranch", + "BriefDescription": "This metric represents fraction of cycles whe= re the Divider unit was active", + "MetricExpr": "ARITH.DIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "BvCB;Clocks;TopdownL3;tma_L3_group;tma_core_bound_= group", + "MetricName": "tma_divider", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIV_ACTIVE", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per retired conditional Branch M= isprediction where the branch was not taken", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / (cpu_atom@BR_MISP_RETI= RED.COND@ - cpu_atom@BR_MISP_RETIRED.COND_TAKEN@)", - "MetricName": "tma_info_br_inst_mix_ipmisp_cond_ntaken", + "BriefDescription": "This metric estimates how often the CPU was s= talled on accesses to external memory (DRAM) by loads", + "MetricExpr": "MEMORY_ACTIVITY.STALLS_L3_MISS / tma_info_thread_cl= ks", + "MetricGroup": "MemoryBound;Stalls;TmaL3mem;TopdownL3;tma_L3_group= ;tma_memory_bound_group", + "MetricName": "tma_dram_bound", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per retired conditional Branch M= isprediction where the branch was taken", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.COND_TAKEN@", - "MetricName": "tma_info_br_inst_mix_ipmisp_cond_taken", + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to DSB (decoded uop cache) fetch pipe= line", + "MetricExpr": "(IDQ.DSB_CYCLES_ANY - IDQ.DSB_CYCLES_OK) / tma_info= _core_core_clks / 2", + "MetricGroup": "DSB;FetchBW;Slots_Estimated;TopdownL3;tma_L3_group= ;tma_fetch_bandwidth_group", + "MetricName": "tma_dsb", + "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per retired indirect call or jum= p Branch Misprediction", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.INDIRECT@", - "MetricName": "tma_info_br_inst_mix_ipmisp_indirect", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to switches from DSB to MITE pipelines", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", + "MetricGroup": "Clocks;DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma= _fetch_latency_group;tma_issueFB", + "MetricName": "tma_dsb_switches", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per retired return Branch Mispre= diction", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.RETURN@", - "MetricName": "tma_info_br_inst_mix_ipmisp_ret", + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", + "MetricExpr": "(min(MEM_INST_RETIRED.STLB_HIT_LOADS * cpu_core@MEM= _INST_RETIRED.STLB_HIT_LOADS@R, MEM_INST_RETIRED.STLB_HIT_LOADS * 7) if 0 <= cpu_core@MEM_INST_RETIRED.STLB_HIT_LOADS@R else MEM_INST_RETIRED.STLB_HIT_= LOADS * 7) / tma_info_thread_clks + tma_load_stlb_miss", + "MetricGroup": "BvMT;Clocks_Estimated;MemoryTLB;TopdownL4;tma_L4_g= roup;tma_issueTLB;tma_l1_bound_group", + "MetricName": "tma_dtlb_load", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per retired Branch Misprediction= ", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.ALL_BRANCHES@", - "MetricName": "tma_info_br_inst_mix_ipmispredict", + "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", + "MetricExpr": "(min(MEM_INST_RETIRED.STLB_HIT_STORES * cpu_core@ME= M_INST_RETIRED.STLB_HIT_STORES@R, MEM_INST_RETIRED.STLB_HIT_STORES * 7) if = 0 < cpu_core@MEM_INST_RETIRED.STLB_HIT_STORES@R else MEM_INST_RETIRED.STLB_= HIT_STORES * 7) / tma_info_thread_clks + tma_store_stlb_miss", + "MetricGroup": "BvMT;Clocks_Estimated;MemoryTLB;TopdownL4;tma_L4_g= roup;tma_issueTLB;tma_store_bound_group", + "MetricName": "tma_dtlb_store", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Ratio of all branches which mispredict", - "MetricExpr": "cpu_atom@BR_MISP_RETIRED.ALL_BRANCHES@ / cpu_atom@B= R_INST_RETIRED.ALL_BRANCHES@", - "MetricName": "tma_info_br_mispredict_bound_branch_mispredict_rati= o", + "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", + "MetricExpr": "28 * tma_info_system_core_frequency * OCR.DEMAND_RF= O.L3_HIT.SNOOP_HITM / tma_info_thread_clks", + "MetricGroup": "BvMS;Clocks_Estimated;DataSharing;LockCont;Offcore= ;Snoop;TopdownL4;tma_L4_group;tma_issueSyncxn;tma_store_bound_group", + "MetricName": "tma_false_sharing", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Ratio between Mispredicted branches and unkno= wn branches", - "MetricExpr": "cpu_atom@BR_MISP_RETIRED.ALL_BRANCHES@ / cpu_atom@B= ACLEARS.ANY@", - "MetricName": "tma_info_br_mispredict_bound_branch_mispredict_to_u= nknown_branch_ratio", + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that does not require the = use of microcode, classified as a fast nuke, due to memory ordering, memory= disambiguation and memory renaming", + "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.FASTNUKE@ / (6 * c= pu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", + "MetricName": "tma_fast_nuke", + "MetricThreshold": "(tma_fast_nuke >0.05) & ((tma_machine_clears >= 0.05) & ((tma_bad_speculation >0.15)))", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of time that allocation is stalled= due to load buffer full", - "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.LD_BUF@ / cpu_at= om@CPU_CLK_UNHALTED.CORE@", - "MetricName": "tma_info_buffer_stalls_%_load_buffer_stall_cycles", + "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", + "MetricExpr": "L1D_PEND_MISS.FB_FULL / tma_info_thread_clks", + "MetricGroup": "BvMB;Clocks_Calculated;MemoryBW;TopdownL4;tma_L4_g= roup;tma_issueBW;tma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricName": "tma_fb_full", + "MetricThreshold": "tma_fb_full > 0.3", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of time that allocation is stalled= due to memory reservation stations full", - "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.RSV@ / cpu_atom@= CPU_CLK_UNHALTED.CORE@", - "MetricName": "tma_info_buffer_stalls_%_mem_rsv_stall_cycles", - "Unit": "cpu_atom" - }, + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend bandwidth issues", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)", + "MetricGroup": "Default;FetchBW;Frontend;Slots;TmaL2;TopdownL2;tma= _L2_group;tma_frontend_bound_group;tma_issueFB", + "MetricName": "tma_fetch_bandwidth", + "MetricThreshold": "tma_fetch_bandwidth > 0.2", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, { - "BriefDescription": "Percentage of time that allocation is stalled= due to store buffer full", - "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.ST_BUF@ / cpu_at= om@CPU_CLK_UNHALTED.CORE@", - "MetricName": "tma_info_buffer_stalls_%_store_buffer_stall_cycles", + "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend latency issues", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.U= OP_DROPPING / tma_info_thread_slots", + "MetricGroup": "Default;Frontend;Slots;TmaL2;TopdownL2;tma_L2_grou= p;tma_frontend_bound_group", + "MetricName": "tma_fetch_latency", + "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Cycles Per Instruction", - "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.CORE@ / cpu_atom@INST_RET= IRED.ANY@", - "MetricName": "tma_info_core_cpi", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", + "MetricExpr": "max(0, tma_heavy_operations - tma_microcode_sequenc= er)", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_heavy_operations_= group;tma_issueD0", + "MetricName": "tma_few_uops_instructions", + "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions.= Related metrics: tma_decoder0_alone", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Floating Point Operations Per Cycle", - "MetricExpr": "cpu_atom@FP_FLOPS_RETIRED.ALL@ / cpu_atom@CPU_CLK_U= NHALTED.CORE@", + "BriefDescription": "This metric represents overall arithmetic flo= ating-point (FP) operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_x87_use + tma_fp_scalar + tma_fp_vector", + "MetricGroup": "HPC;TopdownL3;Uops;tma_L3_group;tma_light_operatio= ns_group", + "MetricName": "tma_fp_arith", + "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Floating Point (FP) Assists", + "MetricExpr": "30 * ASSISTS.FP / tma_info_thread_slots", + "MetricGroup": "HPC;Slots_Estimated;TopdownL5;tma_L5_group;tma_ass= ists_group", + "MetricName": "tma_fp_assists", + "MetricThreshold": "tma_fp_assists > 0.1", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "MetricExpr": "ARITH.FPDIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_fp_divider", + "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) scalar uops fraction the CPU has retired", + "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / (tma_retiring * tma_= info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;Uops;tma_L4_group;tma_fp_a= rith_group;tma_issue2P", + "MetricName": "tma_fp_scalar", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma= _port_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric approximates arithmetic floating-= point (FP) vector uops fraction the CPU has retired aggregated across all v= ector widths", + "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / (tma_retiring * tma_= info_thread_slots)", + "MetricGroup": "Compute;Flops;TopdownL4;Uops;tma_L4_group;tma_fp_a= rith_group;tma_issue2P", + "MetricName": "tma_fp_vector", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized= _2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 128-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", + "MetricGroup": "Compute;Flops;TopdownL5;Uops;tma_L5_group;tma_fp_v= ector_group;tma_issue2P", + "MetricName": "tma_fp_vector_128b", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 256-bit wide vectors", + "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", + "MetricGroup": "Compute;Flops;TopdownL5;Uops;tma_L5_group;tma_fp_v= ector_group;tma_issue2P", + "MetricName": "tma_fp_vector_256b", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_int_vector_128b, tma_int_vector_256b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This category represents fraction of slots wh= ere the processor's Frontend undersupplies its Backend", + "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UO= P_DROPPING / tma_info_thread_slots", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_frontend_bound", + "MetricThreshold": "tma_frontend_bound > 0.15", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.MACRO_FUSED / (= tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Branches;BvBO;Pipeline;Slots;TopdownL3;tma_L3_grou= p;tma_light_operations_group", + "MetricName": "tma_fused_instructions", + "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Default;Retire;Slots;TmaL2;TopdownL2;tma_L2_group;= tma_retiring_group", + "MetricName": "tma_heavy_operations", + "MetricThreshold": "tma_heavy_operations > 0.1", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+]). Sample with: UOPS_RET= IRED.HEAVY", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", + "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricName": "tma_icache_misses", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend bandwidth restrictions due to d= ecode, predecode, cisc, and other limitations.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH@ / (6 = * cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_ifetch_bandwidth", + "MetricThreshold": "(tma_ifetch_bandwidth >0.10) & ((tma_frontend_= bound >0.20))", + "MetricgroupNoGroup": "TopdownL2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend latency restrictions due to ica= che misses, itlb misses, branch detection, and resteer limitations.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.FRONTEND_LATENCY@ / (6 * = cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricName": "tma_ifetch_latency", + "MetricThreshold": "(tma_ifetch_latency >0.15) & ((tma_frontend_bo= und >0.20))", + "MetricgroupNoGroup": "TopdownL2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect CALL instructions= ", + "MetricExpr": "BR_MISP_RETIRED.INDIRECT_CALL_COST * cpu_core@BR_MI= SP_RETIRED.INDIRECT_CALL_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_ind_call_mispredicts", + "MetricThreshold": "tma_ind_call_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect JMP instructions", + "MetricExpr": "max((BR_MISP_RETIRED.INDIRECT_COST * cpu_core@BR_MI= SP_RETIRED.INDIRECT_COST@R - BR_MISP_RETIRED.INDIRECT_CALL_COST * cpu_core@= BR_MISP_RETIRED.INDIRECT_CALL_COST@R) / tma_info_thread_clks, 0)", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_ind_jump_mispredicts", + "MetricThreshold": "tma_ind_jump_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Floating Point (FP) Operatio= n", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@FP_FLOPS_RETI= RED.ALL@", "MetricGroup": "Flops", - "MetricName": "tma_info_core_flopc", + "MetricName": "tma_info_arith_inst_mix_ipflop", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions Per Cycle", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@CPU_CLK_UNHAL= TED.CORE@", - "MetricName": "tma_info_core_ipc", + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / (cpu_atom@FP_INST_RETI= RED.128B_DP@ + cpu_atom@FP_INST_RETIRED.128B_SP@)", + "MetricGroup": "Flops", + "MetricName": "tma_info_arith_inst_mix_ipfparith_avx128", "Unit": "cpu_atom" }, { - "BriefDescription": "Uops Per Instruction", - "MetricExpr": "cpu_atom@TOPDOWN_RETIRING.ALL_P@ / cpu_atom@INST_RE= TIRED.ANY@", - "MetricName": "tma_info_core_upi", + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@FP_INST_RETIR= ED.64B_DP@", + "MetricGroup": "Flops", + "MetricName": "tma_info_arith_inst_mix_ipfparith_scalar_dp", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L2", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.L2_HIT@ / cp= u_atom@MEM_BOUND_STALLS_IFETCH.ALL@", - "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 2hit", + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@FP_INST_RETIR= ED.32B_SP@", + "MetricGroup": "Flops", + "MetricName": "tma_info_arith_inst_mix_ipfparith_scalar_sp", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L3", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.LLC_HIT@ / c= pu_atom@MEM_BOUND_STALLS_IFETCH.ALL@", - "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 3hit", + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 6 / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "MetricGroup": "Bad;BrMispredicts;Core_Metric;tma_issueBM", + "MetricName": "tma_info_bad_spec_branch_misprediction_cost", + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss subsequently misses in the L3", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.LLC_MISS@ / = cpu_atom@MEM_BOUND_STALLS_IFETCH.ALL@", - "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 3miss", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", + "MetricGroup": "Bad;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", + "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L2", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_LOAD.L2_HIT@ / cpu_= atom@MEM_BOUND_STALLS_LOAD.ALL@", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l2hit= ", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", + "MetricGroup": "Bad;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", + "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_taken < 200", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L3", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_LOAD.LLC_HIT@ / cpu= _atom@MEM_BOUND_STALLS_LOAD.ALL@", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l3hit= ", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", + "MetricGroup": "Bad;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmisp_indirect", + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that subsequently misses the L3", - "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_LOAD.LLC_MISS@ / cp= u_atom@MEM_BOUND_STALLS_LOAD.ALL@", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l3mis= s", + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", + "MetricGroup": "Bad;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmisp_ret", + "MetricThreshold": "tma_info_bad_spec_ipmisp_ret < 500", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;BadSpec;BrMispredicts;Inst_Metric", + "MetricName": "tma_info_bad_spec_ipmispredict", + "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", + "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", + "MetricGroup": "BrMispredicts;Metric", + "MetricName": "tma_info_bad_spec_spec_clears_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Probability of Core Bound bottleneck hidden b= y SMT-profiling artifacts", + "MetricExpr": "(100 * (1 - tma_core_bound / tma_ports_utilization = if tma_core_bound < tma_ports_utilization else 1) if tma_info_system_smt_2t= _utilization > 0.5 else 0)", + "MetricGroup": "Cor;Metric;SMT", + "MetricName": "tma_info_botlnk_l0_core_bound_likely", + "MetricThreshold": "tma_info_botlnk_l0_core_bound_likely > 0.5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_lsd + tma_ms)))", + "MetricGroup": "DSB;Fed;FetchBW;Scaled_Slots;tma_issueFB", + "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", + "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd + tma_ms))", + "MetricGroup": "DSBmiss;Fed;Scaled_Slots;tma_issueFB", + "MetricName": "tma_info_botlnk_l2_dsb_misses", + "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", + "PublicDescription": "Total pipeline cost of DSB (uop cache) misse= s - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb= _switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", + "MetricGroup": "Fed;FetchLat;IcMiss;Scaled_Slots;tma_issueFL", + "MetricName": "tma_info_botlnk_l2_ic_misses", + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that retirement is stalled= due to a first level data TLB miss", + "MetricExpr": "100 * (cpu_atom@LD_HEAD.DTLB_MISS_AT_RET@ + cpu_ato= m@LD_HEAD.PGWALK_AT_RET@) / cpu_atom@CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_bottleneck_%_dtlb_miss_bound_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that allocation and retire= ment is stalled by the Frontend Cluster due to an Ifetch Miss, either Icach= e or ITLB Miss", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.ALL@ / cpu_a= tom@CPU_CLK_UNHALTED.CORE@", + "MetricGroup": "Ifetch", + "MetricName": "tma_info_bottleneck_%_ifetch_miss_bound_cycles", + "PublicDescription": "Percentage of time that allocation and retir= ement is stalled by the Frontend Cluster due to an Ifetch Miss, either Icac= he or ITLB Miss. See Info.Ifetch_Bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that retirement is stalled= due to an L1 miss", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_LOAD.ALL@ / cpu_ato= m@CPU_CLK_UNHALTED.CORE@", + "MetricGroup": "Load_Store_Miss", + "MetricName": "tma_info_bottleneck_%_load_miss_bound_cycles", + "PublicDescription": "Percentage of time that retirement is stalle= d due to an L1 miss. See Info.Load_Miss_Bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that retirement is stalled= by the Memory Cluster due to a pipeline stall", + "MetricExpr": "100 * cpu_atom@LD_HEAD.ANY_AT_RET@ / cpu_atom@CPU_C= LK_UNHALTED.CORE@", + "MetricGroup": "Mem_Exec", + "MetricName": "tma_info_bottleneck_%_mem_exec_bound_cycles", + "PublicDescription": "Percentage of time that retirement is stalle= d by the Memory Cluster due to a pipeline stall. See Info.Mem_Exec_Bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.ALL_BRANCHES@", + "MetricName": "tma_info_br_inst_mix_ipbranch", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instruction per (near) call (lower number mea= ns higher occurrence rate)", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.NEAR_CALL@", + "MetricName": "tma_info_br_inst_mix_ipcall", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_INST_RETIR= ED.FAR_BRANCH@u", + "MetricName": "tma_info_br_inst_mix_ipfarbranch", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired conditional Branch M= isprediction where the branch was not taken", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / (cpu_atom@BR_MISP_RETI= RED.COND@ - cpu_atom@BR_MISP_RETIRED.COND_TAKEN@)", + "MetricName": "tma_info_br_inst_mix_ipmisp_cond_ntaken", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired conditional Branch M= isprediction where the branch was taken", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.COND_TAKEN@", + "MetricName": "tma_info_br_inst_mix_ipmisp_cond_taken", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired indirect call or jum= p Branch Misprediction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.INDIRECT@", + "MetricName": "tma_info_br_inst_mix_ipmisp_indirect", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired return Branch Mispre= diction", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.RETURN@", + "MetricName": "tma_info_br_inst_mix_ipmisp_ret", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per retired Branch Misprediction= ", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@BR_MISP_RETIR= ED.ALL_BRANCHES@", + "MetricName": "tma_info_br_inst_mix_ipmispredict", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Ratio of all branches which mispredict", + "MetricExpr": "cpu_atom@BR_MISP_RETIRED.ALL_BRANCHES@ / cpu_atom@B= R_INST_RETIRED.ALL_BRANCHES@", + "MetricName": "tma_info_br_mispredict_bound_branch_mispredict_rati= o", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Ratio between Mispredicted branches and unkno= wn branches", + "MetricExpr": "cpu_atom@BR_MISP_RETIRED.ALL_BRANCHES@ / cpu_atom@B= ACLEARS.ANY@", + "MetricName": "tma_info_br_mispredict_bound_branch_mispredict_to_u= nknown_branch_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are CALL or RET", + "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_R= ETURN) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;Fraction", + "MetricName": "tma_info_branches_callret", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are non-taken condi= tionals", + "MetricExpr": "BR_INST_RETIRED.COND_NTAKEN / BR_INST_RETIRED.ALL_B= RANCHES", + "MetricGroup": "Bad;Branches;CodeGen;Fraction;PGO", + "MetricName": "tma_info_branches_cond_nt", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are taken condition= als", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN / BR_INST_RETIRED.ALL_BR= ANCHES", + "MetricGroup": "Bad;Branches;CodeGen;Fraction;PGO", + "MetricName": "tma_info_branches_cond_tk", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches that are unconditional (= direct or indirect) jumps", + "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_= TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL) / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Bad;Branches;Fraction", + "MetricName": "tma_info_branches_jump", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of branches of other types (not indi= vidually covered by other metrics in Info.Branches group)", + "MetricExpr": "1 - (tma_info_branches_cond_nt + tma_info_branches_= cond_tk + tma_info_branches_callret + tma_info_branches_jump)", + "MetricGroup": "Bad;Branches;Fraction", + "MetricName": "tma_info_branches_other_branches", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that allocation is stalled= due to load buffer full", + "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.LD_BUF@ / cpu_at= om@CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_buffer_stalls_%_load_buffer_stall_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that allocation is stalled= due to memory reservation stations full", + "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.RSV@ / cpu_atom@= CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_buffer_stalls_%_mem_rsv_stall_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that allocation is stalled= due to store buffer full", + "MetricExpr": "100 * cpu_atom@MEM_SCHEDULER_BLOCK.ST_BUF@ / cpu_at= om@CPU_CLK_UNHALTED.CORE@", + "MetricName": "tma_info_buffer_stalls_%_store_buffer_stall_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", + "MetricExpr": "(CPU_CLK_UNHALTED.DISTRIBUTED if #SMT_on else tma_i= nfo_thread_clks)", + "MetricGroup": "Count;SMT", + "MetricName": "tma_info_core_core_clks", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions Per Cycle across hyper-threads (= per physical core)", + "MetricExpr": "INST_RETIRED.ANY / tma_info_core_core_clks", + "MetricGroup": "Core_Metric;Ret;SMT;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_info_core_coreipc", + "MetricgroupNoGroup": "TopdownL1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles Per Instruction", + "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.CORE@ / cpu_atom@INST_RET= IRED.ANY@", + "MetricName": "tma_info_core_cpi", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "uops Executed per Cycle", + "MetricExpr": "UOPS_EXECUTED.THREAD / tma_info_thread_clks", + "MetricGroup": "Metric;Power", + "MetricName": "tma_info_core_epc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Floating Point Operations Per Cycle", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / tma_info_core_core_clks", + "MetricGroup": "Flops", + "MetricName": "tma_info_core_flopc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Actual per-core usage of the Floating Point n= on-X87 execution units (regardless of precision or vector-width)", + "MetricExpr": "(FP_ARITH_DISPATCHED.PORT_0 + FP_ARITH_DISPATCHED.P= ORT_1 + FP_ARITH_DISPATCHED.PORT_5) / (2 * tma_info_core_core_clks)", + "MetricGroup": "Cor;Core_Metric;Flops;HPC", + "MetricName": "tma_info_core_fp_arith_utilization", + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu_core@UOPS_EXECUTED.THREA= D\\,cmask\\=3D0x1@", + "MetricGroup": "Backend;Cor;Metric;Pipeline;PortsUtil", + "MetricName": "tma_info_core_ilp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions Per Cycle", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@CPU_CLK_UNHAL= TED.CORE@", + "MetricName": "tma_info_core_ipc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "cpu_atom@TOPDOWN_RETIRING.ALL_P@ / cpu_atom@INST_RE= TIRED.ANY@", + "MetricName": "tma_info_core_upi", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of Uops delivered by the DSB (aka De= coded ICache; or Uop Cache)", + "MetricExpr": "IDQ.DSB_UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "DSB;Fed;FetchBW;Metric;tma_issueFB", + "MetricName": "tma_info_frontend_dsb_coverage", + "MetricThreshold": "tma_info_frontend_dsb_coverage < 0.7 & tma_inf= o_thread_ipc / 6 > 0.35", + "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu_core@DSB2MIT= E_SWITCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricGroup": "DSBmiss;Metric", + "MetricName": "tma_info_frontend_dsb_switch_cost", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired DSB misses", + "MetricExpr": "FRONTEND_RETIRED.ANY_DSB_MISS * cpu_core@FRONTEND_R= ETIRED.ANY_DSB_MISS@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;DSBmiss;Fed;FetchLat", + "MetricName": "tma_info_frontend_dsb_switches_ret", + "MetricThreshold": "tma_info_frontend_dsb_switches_ret > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", + "MetricExpr": "UOPS_ISSUED.ANY / cpu_core@UOPS_ISSUED.ANY\\,cmask\= \=3D0x1@", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_frontend_fetch_upc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Latency for L1 instruction cache miss= es", + "MetricExpr": "ICACHE_DATA.STALLS / ICACHE_DATA.STALL_PERIODS", + "MetricGroup": "Fed;FetchLat;IcMiss;Metric", + "MetricName": "tma_info_frontend_icache_miss_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per non-speculative DSB miss (lo= wer number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", + "MetricGroup": "DSBmiss;Fed;Inst_Metric", + "MetricName": "tma_info_frontend_ipdsb_miss_ret", + "MetricThreshold": "tma_info_frontend_ipdsb_miss_ret < 50", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per speculative Unknown Branch M= isprediction (BAClear) (lower number means higher occurrence rate)", + "MetricExpr": "tma_info_inst_mix_instructions / BACLEARS.ANY", + "MetricGroup": "Fed;Metric", + "MetricName": "tma_info_frontend_ipunknown_branch", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache true code cacheline misses per kilo = instruction", + "MetricExpr": "1e3 * FRONTEND_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss;Metric", + "MetricName": "tma_info_frontend_l2mpki_code", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache speculative code cacheline misses pe= r kilo instruction", + "MetricExpr": "1e3 * L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", + "MetricGroup": "IcMiss;Metric", + "MetricName": "tma_info_frontend_l2mpki_code_all", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of Uops delivered by the LSD (Loop S= tream Detector; aka Loop Cache)", + "MetricExpr": "LSD.UOPS / UOPS_ISSUED.ANY", + "MetricGroup": "Fed;LSD;Metric", + "MetricName": "tma_info_frontend_lsd_coverage", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired operations that invoke th= e Microcode Sequencer", + "MetricExpr": "FRONTEND_RETIRED.MS_FLOWS * cpu_core@FRONTEND_RETIR= ED.MS_FLOWS@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;Fed;FetchLat;MicroSeq", + "MetricName": "tma_info_frontend_ms_latency_ret", + "MetricThreshold": "tma_info_frontend_ms_latency_ret > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW;Metric", + "MetricName": "tma_info_frontend_tbpc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of cycles the front-end was de= layed due to an Unknown Branch detection", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu_core@INT_MISC.= UNKNOWN_BRANCH_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", + "MetricGroup": "Fed;Metric", + "MetricName": "tma_info_frontend_unknown_branch_cost", + "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired branches who got branch a= ddress clears", + "MetricExpr": "FRONTEND_RETIRED.UNKNOWN_BRANCH * cpu_core@FRONTEND= _RETIRED.UNKNOWN_BRANCH@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;Fed;FetchLat", + "MetricName": "tma_info_frontend_unknown_branches_ret", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L2", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.L2_HIT@ / cp= u_atom@MEM_BOUND_STALLS_IFETCH.ALL@", + "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 2hit", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss doesn't hit in the L2", + "MetricExpr": "100 * (cpu_atom@MEM_BOUND_STALLS_IFETCH.LLC_HIT@ + = cpu_atom@MEM_BOUND_STALLS_IFETCH.LLC_MISS@) / cpu_atom@MEM_BOUND_STALLS_IFE= TCH.ALL@", + "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 2miss", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L3", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.LLC_HIT@ / c= pu_atom@MEM_BOUND_STALLS_IFETCH.ALL@", + "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 3hit", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss subsequently misses in the L3", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_IFETCH.LLC_MISS@ / = cpu_atom@MEM_BOUND_STALLS_IFETCH.ALL@", + "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 3miss", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Branch instructions per taken branch", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", + "MetricGroup": "Branches;Fed;Metric;PGO", + "MetricName": "tma_info_inst_mix_bptkbranch", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total number of retired Instructions", + "MetricExpr": "INST_RETIRED.ANY", + "MetricGroup": "Count;Summary;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_info_inst_mix_instructions", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "Total number of retired Instructions. Sample= with: INST_RETIRED.PREC_DIST", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic instruction (l= ower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = FP_ARITH_INST_RETIRED.VECTOR)", + "MetricGroup": "Flops;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith", + "MetricThreshold": "tma_info_inst_mix_iparith < 10", + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_avx128", + "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;FpVector;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_avx256", + "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOU= BLE", + "MetricGroup": "Flops;FpScalar;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_scalar_dp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SIN= GLE", + "MetricGroup": "Flops;FpScalar;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_iparith_scalar_sp", + "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", + "MetricGroup": "Branches;Fed;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipbranch", + "MetricThreshold": "tma_info_inst_mix_ipbranch < 8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per (near) call (lower number me= ans higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", + "MetricGroup": "Branches;Fed;Inst_Metric;PGO", + "MetricName": "tma_info_inst_mix_ipcall", + "MetricThreshold": "tma_info_inst_mix_ipcall < 200", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Floating Point (FP) Operatio= n (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_= FLOPS + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", + "MetricGroup": "Flops;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipflop", + "MetricThreshold": "tma_info_inst_mix_ipflop < 10", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Load (lower number means hig= her occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", + "MetricGroup": "InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipload", + "MetricThreshold": "tma_info_inst_mix_ipload < 3", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per PAUSE (lower number means hi= gher occurrence rate)", + "MetricExpr": "tma_info_inst_mix_instructions / CPU_CLK_UNHALTED.P= AUSE_INST", + "MetricGroup": "Flops;FpVector;InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ippause", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Store (lower number means hi= gher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", + "MetricGroup": "InsType;Inst_Metric", + "MetricName": "tma_info_inst_mix_ipstore", + "MetricThreshold": "tma_info_inst_mix_ipstore < 8", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", + "MetricExpr": "INST_RETIRED.ANY / SW_PREFETCH_ACCESS.ANY", + "MetricGroup": "Inst_Metric;Prefetches", + "MetricName": "tma_info_inst_mix_ipswpf", + "MetricThreshold": "tma_info_inst_mix_ipswpf < 100", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per taken branch", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Frontend;Inst_Metric;PGO;tma_= issueFB", + "MetricName": "tma_info_inst_mix_iptb", + "MetricThreshold": "tma_info_inst_mix_iptb < 6 * 2 + 1", + "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L2", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_LOAD.L2_HIT@ / cpu_= atom@MEM_BOUND_STALLS_LOAD.ALL@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l2hit= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that subsequently misses in the L2", + "MetricExpr": "100 * (cpu_atom@MEM_BOUND_STALLS_LOAD.LLC_HIT@ + cp= u_atom@MEM_BOUND_STALLS_LOAD.LLC_MISS@) / cpu_atom@MEM_BOUND_STALLS_LOAD.AL= L@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l2mis= s", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L3", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_LOAD.LLC_HIT@ / cpu= _atom@MEM_BOUND_STALLS_LOAD.ALL@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l3hit= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that subsequently misses the L3", + "MetricExpr": "100 * cpu_atom@MEM_BOUND_STALLS_LOAD.LLC_MISS@ / cp= u_atom@MEM_BOUND_STALLS_LOAD.ALL@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l3mis= s", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the oldest l= oad of the load buffer is stalled at retirement due to a pipeline block", + "MetricExpr": "100 * cpu_atom@LD_HEAD.L1_BOUND_AT_RET@ / cpu_atom@= CPU_CLK_UNHALTED.CORE@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_store_bound_l1_bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles that the oldest l= oad of the load buffer is stalled at retirement", + "MetricExpr": "100 * (cpu_atom@LD_HEAD.L1_BOUND_AT_RET@ + cpu_atom= @MEM_BOUND_STALLS_LOAD.ALL@) / cpu_atom@CPU_CLK_UNHALTED.CORE@", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_store_bound_load_bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of cycles the core is stall= ed due to store buffer full", + "MetricExpr": "100 * (cpu_atom@MEM_SCHEDULER_BLOCK.ST_BUF@ / cpu_a= tom@MEM_SCHEDULER_BLOCK.ALL@) * tma_mem_scheduler", + "MetricGroup": "load_store_bound", + "MetricName": "tma_info_load_store_bound_store_bound", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to floating point assists", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.FP_ASSIST@ / cpu_atom= @INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_fp_assi= st_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to page faults", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.PAGE_FAULT@ / cpu_ato= m@INST_RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_page_fa= ult_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to self-modifying code", + "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.SMC@ / cpu_atom@INST_= RETIRED.ANY@", + "MetricName": "tma_info_machine_clear_bound_machine_clears_smc_pki= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads wit= h an address aliasing block", + "MetricExpr": "100 * cpu_atom@LD_BLOCKS.ADDRESS_ALIAS@ / cpu_atom@= MEM_UOPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_exec_blocks_%_loads_with_adressaliasin= g", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads wit= h a store forward or unknown store address block", + "MetricExpr": "100 * cpu_atom@LD_BLOCKS.DATA_UNKNOWN@ / cpu_atom@M= EM_UOPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_exec_blocks_%_loads_with_storefwdblk", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= first level data cache miss", + "MetricExpr": "100 * cpu_atom@LD_HEAD.L1_MISS_AT_RET@ / cpu_atom@L= D_HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_l1miss", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to o= ther block cases, such as pipeline conflicts, fences, etc", + "MetricExpr": "100 * cpu_atom@LD_HEAD.OTHER_AT_RET@ / cpu_atom@LD_= HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_otherpipeli= neblks", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= pagewalk", + "MetricExpr": "100 * cpu_atom@LD_HEAD.PGWALK_AT_RET@ / cpu_atom@LD= _HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_pagewalk", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= second level TLB miss", + "MetricExpr": "100 * cpu_atom@LD_HEAD.DTLB_MISS_AT_RET@ / cpu_atom= @LD_HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_stlbhit", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of Memory Execution Bound due to a= store forward address match", + "MetricExpr": "100 * cpu_atom@LD_HEAD.ST_ADDR_AT_RET@ / cpu_atom@L= D_HEAD.ANY_AT_RET@", + "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_storefwding= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Load", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@MEM_UOPS_RETI= RED.ALL_LOADS@", + "MetricName": "tma_info_mem_mix_ipload", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Store", + "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@MEM_UOPS_RETI= RED.ALL_STORES@", + "MetricName": "tma_info_mem_mix_ipstore", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads tha= t perform one or more locks", + "MetricExpr": "100 * cpu_atom@MEM_UOPS_RETIRED.LOCK_LOADS@ / cpu_a= tom@MEM_UOPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_mix_load_locks_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of total non-speculative loads tha= t are splits", + "MetricExpr": "100 * cpu_atom@MEM_UOPS_RETIRED.SPLIT_LOADS@ / cpu_= atom@MEM_UOPS_RETIRED.ALL_LOADS@", + "MetricName": "tma_info_mem_mix_load_splits_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Ratio of mem load uops to all uops", + "MetricExpr": "1e3 * cpu_atom@MEM_UOPS_RETIRED.ALL_LOADS@ / cpu_at= om@TOPDOWN_RETIRING.ALL_P@", + "MetricName": "tma_info_mem_mix_memload_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-core data fill bandwidth to the L= 1 data cache [GB / sec]", + "MetricExpr": "tma_info_memory_l1d_cache_fill_bw", + "MetricGroup": "Core_Metric;Mem;MemoryBW", + "MetricName": "tma_info_memory_core_l1d_cache_fill_bw_2t", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-core data fill bandwidth to the L= 2 cache [GB / sec]", + "MetricExpr": "tma_info_memory_l2_cache_fill_bw", + "MetricGroup": "Core_Metric;Mem;MemoryBW", + "MetricName": "tma_info_memory_core_l2_cache_fill_bw_2t", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-core data access bandwidth to the= L3 cache [GB / sec]", + "MetricExpr": "tma_info_memory_l3_cache_access_bw", + "MetricGroup": "Core_Metric;Mem;MemoryBW;Offcore", + "MetricName": "tma_info_memory_core_l3_cache_access_bw_2t", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-core data fill bandwidth to the L= 3 cache [GB / sec]", + "MetricExpr": "tma_info_memory_l3_cache_fill_bw", + "MetricGroup": "Core_Metric;Mem;MemoryBW", + "MetricName": "tma_info_memory_core_l3_cache_fill_bw_2t", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fill Buffer (FB) hits per kilo instructions f= or retired demand loads (L1D misses that merge into ongoing miss-handling e= ntries)", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_fb_hpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", + "MetricGroup": "Mem;MemoryBW;Metric", + "MetricName": "tma_info_memory_l1d_cache_fill_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L1 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l1mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L1 cache true misses per kilo instruction for= all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.AN= Y", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l1mpki_load", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", + "MetricGroup": "Mem;MemoryBW;Metric", + "MetricName": "tma_info_memory_l2_cache_fill_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache hits per kilo instruction for all re= quest types (including speculative)", + "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_= RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2hpki_all", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache hits per kilo instruction for all de= mand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.AN= Y", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2hpki_load", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", + "MetricGroup": "Backend;CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all request types (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheHits;Mem;Metric;Offcore", + "MetricName": "tma_info_memory_l2mpki_all", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all demand loads (including speculative)", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.A= NY", + "MetricGroup": "CacheHits;Mem;Metric", + "MetricName": "tma_info_memory_l2mpki_load", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Offcore requests (L2 cache miss) per kilo ins= truction for demand RFOs", + "MetricExpr": "1e3 * L2_RQSTS.RFO_MISS / INST_RETIRED.ANY", + "MetricGroup": "CacheMisses;Metric;Offcore", + "MetricName": "tma_info_memory_l2mpki_rfo", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", + "MetricGroup": "Mem;MemoryBW;Metric;Offcore", + "MetricName": "tma_info_memory_l3_cache_access_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", + "MetricGroup": "Mem;MemoryBW;Metric", + "MetricName": "tma_info_memory_l3_cache_fill_bw", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "L3 cache true misses per kilo instruction for= retired demand loads", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", + "MetricGroup": "Mem;Metric", + "MetricName": "tma_info_memory_l3mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Parallel L2 cache miss data reads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DATA_RD / OFFCORE_REQU= ESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", + "MetricGroup": "Memory_BW;Metric;Offcore", + "MetricName": "tma_info_memory_latency_data_l2_mlp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Latency for L2 cache miss demand Load= s", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "Clocks_Latency;LockCont;Memory_Lat;Offcore", + "MetricName": "tma_info_memory_latency_load_l2_miss_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Parallel L2 cache miss demand Loads", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu_c= ore@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", + "MetricGroup": "Memory_BW;Metric;Offcore", + "MetricName": "tma_info_memory_latency_load_l2_mlp", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Latency for L3 cache miss demand Load= s", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAND_DATA_RD= / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", + "MetricGroup": "Clocks_Latency;Memory_Lat;Offcore", + "MetricName": "tma_info_memory_latency_load_l3_miss_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Actual Average Latency for L1 data-cache miss= demand load operations (in core cycles)", + "MetricExpr": "L1D_PEND_MISS.PENDING / MEM_LOAD_COMPLETED.L1_MISS_= ANY", + "MetricGroup": "Clocks_Latency;Mem;MemoryBound;MemoryLat", + "MetricName": "tma_info_memory_load_miss_real_latency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "\"Bus lock\" per kilo instruction", + "MetricExpr": "1e3 * SQ_MISC.BUS_LOCK / INST_RETIRED.ANY", + "MetricGroup": "Mem;Metric", + "MetricName": "tma_info_memory_mix_bus_lock_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Un-cacheable retired load per kilo instructio= n", + "MetricExpr": "1e3 * MEM_LOAD_MISC_RETIRED.UC / INST_RETIRED.ANY", + "MetricGroup": "Mem;Metric", + "MetricName": "tma_info_memory_mix_uc_load_pki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Memory-Level-Parallelism (average number of L= 1 miss demand load when there is at least one such miss", + "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLE= S", + "MetricGroup": "Mem;MemoryBW;MemoryBound;Metric", + "MetricName": "tma_info_memory_mlp", + "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Rate of L2 HW prefetched lines that were not = used by demand accesses", + "MetricExpr": "L2_LINES_OUT.USELESS_HWPF / (L2_LINES_OUT.SILENT + = L2_LINES_OUT.NON_SILENT)", + "MetricGroup": "Metric;Prefetches", + "MetricName": "tma_info_memory_prefetches_useless_hwpf", + "MetricThreshold": "tma_info_memory_prefetches_useless_hwpf > 0.15= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", + "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", + "MetricGroup": "Fed;MemoryTLB;Metric", + "MetricName": "tma_info_memory_tlb_code_stlb_mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand loads", + "MetricExpr": "MEM_INST_RETIRED.STLB_MISS_LOADS * cpu_core@MEM_INS= T_RETIRED.STLB_MISS_LOADS@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_load_stlb_miss_ret", + "MetricThreshold": "tma_info_memory_tlb_load_stlb_miss_ret > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "STLB (2nd level TLB) data load speculative mi= sses per kilo instruction (misses of any page-size that complete the page w= alk)", + "MetricExpr": "1e3 * DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRE= D.ANY", + "MetricGroup": "Mem;MemoryTLB;Metric", + "MetricName": "tma_info_memory_tlb_load_stlb_mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Utilization of the core's Page Walker(s) serv= ing STLB misses triggered by instruction/Load/Store accesses", + "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_P= ENDING + DTLB_STORE_MISSES.WALK_PENDING) / (4 * tma_info_core_core_clks)", + "MetricGroup": "Core_Metric;Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_page_walks_utilization", + "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand stores", + "MetricExpr": "MEM_INST_RETIRED.STLB_MISS_STORES * cpu_core@MEM_IN= ST_RETIRED.STLB_MISS_STORES@R / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_store_stlb_miss_ret", + "MetricThreshold": "tma_info_memory_tlb_store_stlb_miss_ret > 0.05= ", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "STLB (2nd level TLB) data store speculative m= isses per kilo instruction (misses of any page-size that complete the page = walk)", + "MetricExpr": "1e3 * DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIR= ED.ANY", + "MetricGroup": "Mem;MemoryTLB;Metric", + "MetricName": "tma_info_memory_tlb_store_stlb_mpki", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu_core@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", + "MetricGroup": "Cor;Metric;Pipeline;PortsUtil;SMT", + "MetricName": "tma_info_pipeline_execute", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of uops fetched from DSB per c= ycle", + "MetricExpr": "IDQ.DSB_UOPS / IDQ.DSB_CYCLES_ANY", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_pipeline_fetch_dsb", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of uops fetched from LSD per c= ycle", + "MetricExpr": "LSD.UOPS / LSD.CYCLES_ACTIVE", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_pipeline_fetch_lsd", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of uops fetched from MITE per = cycle", + "MetricExpr": "IDQ.MITE_UOPS / IDQ.MITE_CYCLES_ANY", + "MetricGroup": "Fed;FetchBW;Metric", + "MetricName": "tma_info_pipeline_fetch_mite", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per a microcode Assist invocatio= n", + "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", + "MetricGroup": "Inst_Metric;MicroSeq;Pipeline;Ret;Retire", + "MetricName": "tma_info_pipeline_ipassist", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", + "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D0x1@", + "MetricGroup": "Metric;Pipeline;Ret", + "MetricName": "tma_info_pipeline_retire", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", + "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu_core@UOPS_RETIRED.= SLOTS\\,cmask\\=3D0x1@", + "MetricGroup": "Metric;MicroSeq;Pipeline;Ret", + "MetricName": "tma_info_pipeline_strings_cycles", + "MetricThreshold": "tma_info_pipeline_strings_cycles > 0.1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of time that the core is stalled d= ue to a TPAUSE or UMWAIT instruction", + "MetricExpr": "100 * cpu_atom@SERIALIZATION.C01_MS_SCB@ / (6 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", + "MetricName": "tma_info_serialization _%_tpause_cycles", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of cycles the processor is waiting y= et unhalted; covering legacy PAUSE instruction, as well as C0.1 / C0.2 powe= r-performance optimized states", + "MetricExpr": "CPU_CLK_UNHALTED.C0_WAIT / tma_info_thread_clks", + "MetricGroup": "C0Wait;Metric", + "MetricName": "tma_info_system_c0_wait", + "MetricThreshold": "tma_info_system_c0_wait > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", + "MetricGroup": "Power;Summary;System_Metric", + "MetricName": "tma_info_system_core_frequency", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average CPU Utilization (percentage)", + "MetricExpr": "tma_info_system_cpus_utilized / #num_cpus_online", + "MetricGroup": "HPC;Metric;Summary", + "MetricName": "tma_info_system_cpu_utilization", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of utilized CPUs", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", + "MetricGroup": "Metric;Summary", + "MetricName": "tma_info_system_cpus_utilized", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Giga Floating Point Operations Per Second", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / tma_info_system_time", + "MetricGroup": "Flops", + "MetricName": "tma_info_system_gflops", + "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", + "MetricGroup": "Branches;Inst_Metric;OS", + "MetricName": "tma_info_system_ipfarbranch", + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", + "MetricGroup": "Metric;OS", + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THRE= AD", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_kernel_utilization", + "MetricThreshold": "tma_info_system_kernel_utilization > 0.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average number of parallel data read requests= to external memory", + "MetricExpr": "UNC_ARB_DAT_OCCUPANCY.RD / UNC_ARB_DAT_OCCUPANCY.RD= @cmask\\=3D0x1@", + "MetricGroup": "Mem;MemoryBW;SoC;System_Metric", + "MetricName": "tma_info_system_mem_parallel_reads", + "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Clocks;Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "power@energy\\-pkg@ * 61 / (tma_info_system_time * = 1e6)", + "MetricGroup": "Power;SoC;System_Metric", + "MetricName": "tma_info_system_power", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", + "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_U= NHALTED.REF_DISTRIBUTED if #SMT_on else 0)", + "MetricGroup": "Core_Metric;SMT", + "MetricName": "tma_info_system_smt_2t_utilization", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Socket actual clocks when any core is active = on that socket", + "MetricExpr": "UNC_CLOCK.SOCKET", + "MetricGroup": "Count;SoC", + "MetricName": "tma_info_system_socket_clks", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Seconds;Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", + "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", + "MetricGroup": "Power", + "MetricName": "tma_info_system_turbo_utilization", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", + "MetricGroup": "Count;Pipeline", + "MetricName": "tma_info_thread_clks", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", + "MetricExpr": "1 / tma_info_thread_ipc", + "MetricGroup": "Mem;Metric;Pipeline", + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "The ratio of Executed- by Issued-Uops", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", + "MetricGroup": "Cor;Metric;Pipeline", + "MetricName": "tma_info_thread_execute_per_issue", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", + "MetricExpr": "INST_RETIRED.ANY / tma_info_thread_clks", + "MetricGroup": "Metric;Ret;Summary", + "MetricName": "tma_info_thread_ipc", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", + "MetricExpr": "slots", + "MetricGroup": "Count;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_info_thread_slots", + "MetricgroupNoGroup": "TopdownL1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", + "MetricExpr": "(tma_info_thread_slots / (slots / 2) if #SMT_on els= e 1)", + "MetricGroup": "Metric;SMT;TmaL1;TopdownL1;tma_L1_group", + "MetricName": "tma_info_thread_slots_utilization", + "MetricgroupNoGroup": "TopdownL1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Uops Per Instruction", + "MetricExpr": "tma_retiring * tma_info_thread_slots / INST_RETIRED= .ANY", + "MetricGroup": "Metric;Pipeline;Ret;Retire", + "MetricName": "tma_info_thread_uoppi", + "MetricThreshold": "tma_info_thread_uoppi > 1.05", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Uops per taken branch", + "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", + "MetricGroup": "Branches;Fed;FetchBW;Metric", + "MetricName": "tma_info_thread_uptb", + "MetricThreshold": "tma_info_thread_uptb < 6 * 1.5", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are FPDiv uops", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.FPDIV@ / cpu_atom@TOPDO= WN_RETIRING.ALL_P@", + "MetricName": "tma_info_uop_mix_fpdiv_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are IDiv uops", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.IDIV@ / cpu_atom@TOPDOW= N_RETIRING.ALL_P@", + "MetricName": "tma_info_uop_mix_idiv_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are microcode op= s", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.MS@ / cpu_atom@TOPDOWN_= RETIRING.ALL_P@", + "MetricName": "tma_info_uop_mix_microcode_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Percentage of all uops which are x87 uops", + "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.X87@ / cpu_atom@TOPDOWN= _RETIRING.ALL_P@", + "MetricName": "tma_info_uop_mix_x87_uop_ratio", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "MetricExpr": "tma_divider - tma_fp_divider", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_int_divider", + "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents overall Integer (Int) = select operations fraction the CPU has executed (retired)", + "MetricExpr": "tma_int_vector_128b + tma_int_vector_256b", + "MetricGroup": "Pipeline;TopdownL3;Uops;tma_L3_group;tma_light_ope= rations_group", + "MetricName": "tma_int_operations", + "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operation= s > 0.6", + "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents 128-bit vector Integer= ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the= CPU has retired", + "MetricExpr": "(INT_VEC_RETIRED.ADD_128 + INT_VEC_RETIRED.VNNI_128= ) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;Uops;tma_L4_g= roup;tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_128b", + "MetricThreshold": "tma_int_vector_128b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_int_vector_256b, tma_port_0, tma_port_1,= tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents 256-bit vector Integer= ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fraction= the CPU has retired", + "MetricExpr": "(INT_VEC_RETIRED.ADD_256 + INT_VEC_RETIRED.MUL_256 = + INT_VEC_RETIRED.VNNI_256) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;Uops;tma_L4_g= roup;tma_int_operations_group;tma_issue2P", + "MetricName": "tma_int_vector_256b", + "MetricThreshold": "tma_int_vector_256b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_port_0, tma_por= t_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", + "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricName": "tma_itlb_misses", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", + "MetricExpr": "max((EXE_ACTIVITY.BOUND_ON_LOADS - MEMORY_ACTIVITY.= STALLS_L1D_MISS) / tma_info_thread_clks, 0)", + "MetricGroup": "CacheHits;MemoryBound;Stalls;TmaL3mem;TopdownL3;tm= a_L3_group;tma_issueL1;tma_issueMC;tma_memory_bound_group", + "MetricName": "tma_l1_bound", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", + "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - MEMORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricGroup": "BvML;Clocks_Estimated;MemoryLat;TopdownL4;tma_L4_g= roup;tma_l1_bound_group", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", + "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L1D_MISS - MEMORY_ACTIVITY.= STALLS_L2_MISS) / tma_info_thread_clks", + "MetricGroup": "BvML;CacheHits;MemoryBound;Stalls;TmaL3mem;Topdown= L3;tma_L3_group;tma_memory_bound_group", + "MetricName": "tma_l2_bound", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L2_HIT * cpu_core@MEM_LOAD_RE= TIRED.L2_HIT@R, MEM_LOAD_RETIRED.L2_HIT * (3 * tma_info_system_core_frequen= cy)) if 0 < cpu_core@MEM_LOAD_RETIRED.L2_HIT@R else MEM_LOAD_RETIRED.L2_HIT= * (3 * tma_info_system_core_frequency)) * (1 + MEM_LOAD_RETIRED.FB_HIT / M= EM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "Clocks_Retired;MemoryLat;TopdownL4;tma_L4_group;tm= a_l2_bound_group", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates how often the CPU was s= talled due to loads accesses to L3 cache or contended with a sibling Core", + "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L2_MISS - MEMORY_ACTIVITY.S= TALLS_L3_MISS) / tma_info_thread_clks", + "MetricGroup": "CacheHits;MemoryBound;Stalls;TmaL3mem;TopdownL3;tm= a_L3_group;tma_memory_bound_group", + "MetricName": "tma_l3_bound", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L3_HIT * cpu_core@MEM_LOAD_RE= TIRED.L3_HIT@R, MEM_LOAD_RETIRED.L3_HIT * (12 * tma_info_system_core_freque= ncy) - 3 * tma_info_system_core_frequency) if 0 < cpu_core@MEM_LOAD_RETIRED= .L3_HIT@R else MEM_LOAD_RETIRED.L3_HIT * (12 * tma_info_system_core_frequen= cy) - 3 * tma_info_system_core_frequency) * (1 + MEM_LOAD_RETIRED.FB_HIT / = MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "BvML;Clocks_Estimated;MemoryLat;TopdownL4;tma_L4_g= roup;tma_issueLat;tma_l3_bound_group;tma_overlap", + "MetricName": "tma_l3_hit_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles CPU= was stalled due to Length Changing Prefixes (LCPs)", + "MetricExpr": "DECODE.LCP / tma_info_thread_clks", + "MetricGroup": "Clocks;FetchLat;TopdownL3;tma_L3_group;tma_fetch_l= atency_group;tma_issueFB", + "MetricName": "tma_lcp", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", + "MetricGroup": "Default;Retire;Slots;TmaL2;TopdownL2;tma_L2_group;= tma_retiring_group", + "MetricName": "tma_light_operations", + "MetricThreshold": "tma_light_operations > 0.6", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Load operations", + "MetricExpr": "UOPS_DISPATCHED.PORT_2_3_10 / (3 * tma_info_core_co= re_clks)", + "MetricGroup": "Core_Execution;TopdownL5;tma_L5_group;tma_ports_ut= ilized_3m_group", + "MetricName": "tma_load_op_utilization", + "MetricThreshold": "tma_load_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Load operations. Sample with: = UOPS_DISPATCHED.PORT_2_3_10", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) DTLB was missed by load accesses, that late= r on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_dtlb_load - tma_load_stlb_miss)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL5;tma_L5_group;= tma_dtlb_load_group", + "MetricName": "tma_load_stlb_hit", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by load accesses, performing a= hardware page walk", + "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", + "MetricGroup": "Clocks_Calculated;MemoryTLB;TopdownL5;tma_L5_group= ;tma_dtlb_load_group", + "MetricName": "tma_load_stlb_miss", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_load_stlb_miss_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_load_stlb_miss_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_load_stlb_miss_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", + "MetricExpr": "MEM_INST_RETIRED.LOCK_LOADS * cpu_core@MEM_INST_RET= IRED.LOCK_LOADS@R / tma_info_thread_clks", + "MetricGroup": "Clocks;LockCont;Offcore;TopdownL4;tma_L4_group;tma= _issueRFO;tma_l1_bound_group", + "MetricName": "tma_lock_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to LSD (Loop Stream Detector) unit", + "MetricExpr": "(LSD.CYCLES_ACTIVE - LSD.CYCLES_OK) / tma_info_core= _core_clks / 2", + "MetricGroup": "FetchBW;LSD;Slots_Estimated;TopdownL3;tma_L3_group= ;tma_fetch_bandwidth_group", + "MetricName": "tma_lsd", + "MetricThreshold": "tma_lsd > 0.15 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Machine Clears", + "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts= )", + "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", + "MetricName": "tma_machine_clears", + "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu_core@OFFCORE_REQUE= STS_OUTSTANDING.DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;Clocks;MemoryBW;Offcore;TopdownL4;tma_L4_grou= p;tma_dram_bound_group;tma_issueBW", + "MetricName": "tma_mem_bandwidth", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_sq_full", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates fraction of cycles wher= e the performance was likely hurt due to latency from external memory - DRA= M ([SPR-HBM] and/or HBM)", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", + "MetricGroup": "BvML;Clocks;MemoryLat;Offcore;TopdownL4;tma_L4_gro= up;tma_dram_bound_group;tma_issueLat", + "MetricName": "tma_mem_latency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to memory reservation stalls in which a sched= uler is not able to accept uops", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.MEM_SCHEDULER@ / (6 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_mem_scheduler", + "MetricThreshold": "(tma_mem_scheduler >0.10) & ((tma_resource_bou= nd >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", + "DefaultMetricgroupName": "TopdownL2", + "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Backend;Default;Slots;TmaL2;TopdownL2;tma_L2_group= ;tma_backend_bound_group", + "MetricName": "tma_memory_bound", + "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", + "MetricgroupNoGroup": "TopdownL2;Default", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "13 * MISC2_RETIRED.LFENCE / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", + "MetricName": "tma_memory_fence", + "MetricThreshold": "tma_memory_fence > 0.05 & tma_serializing_oper= ation > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "MetricExpr": "tma_light_operations * MEM_UOP_RETIRED.ANY / (tma_r= etiring * tma_info_thread_slots)", + "MetricGroup": "Pipeline;Slots;TopdownL3;tma_L3_group;tma_light_op= erations_group", + "MetricName": "tma_memory_operations", + "MetricThreshold": "tma_memory_operations > 0.1 & tma_light_operat= ions > 0.6", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots the = CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", + "MetricExpr": "UOPS_RETIRED.MS / tma_info_thread_slots", + "MetricGroup": "MicroSeq;Slots;TopdownL3;tma_L3_group;tma_heavy_op= erations_group;tma_issueMC;tma_issueMS", + "MetricName": "tma_microcode_sequencer", + "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_bottleneck_i= rregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, t= ma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Branch Misprediction= at execution stage", + "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * INT_= MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", + "MetricGroup": "BadSpec;BrMispredicts;BvMP;Clocks;TopdownL4;tma_L4= _group;tma_branch_resteers_group;tma_issueBM", + "MetricName": "tma_mispredicts_resteers", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the MITE pipeline (the legacy deco= de pipeline)", + "MetricExpr": "(IDQ.MITE_CYCLES_ANY - IDQ.MITE_CYCLES_OK) / tma_in= fo_core_core_clks / 2", + "MetricGroup": "DSBmiss;FetchBW;Slots_Estimated;TopdownL3;tma_L3_g= roup;tma_fetch_bandwidth_group", + "MetricName": "tma_mite", + "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck. Sa= mple with: FRONTEND_RETIRED.ANY_DSB_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "MetricExpr": "160 * ASSISTS.SSE_AVX_MIX / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL5;tma_L5_group;tma_issueMV;tma_port= s_utilized_0_group", + "MetricName": "tma_mixing_vectors", + "MetricThreshold": "tma_mixing_vectors > 0.05", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "MetricExpr": "max(IDQ.MS_CYCLES_ANY, cpu_core@UOPS_RETIRED.MS\\,c= mask\\=3D0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY)) / tma_info_core_cor= e_clks / 2", + "MetricGroup": "MicroSeq;Slots_Estimated;TopdownL3;tma_L3_group;tm= a_fetch_bandwidth_group", + "MetricName": "tma_ms", + "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", + "MetricExpr": "3 * cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D0x1\\,edge= \\=3D0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY) / tma_info_thread_clks", + "MetricGroup": "Clocks_Estimated;FetchLat;MicroSeq;TopdownL3;tma_L= 3_group;tma_fetch_latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_iss= ueSO", + "MetricName": "tma_ms_switches", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: FRONTEND_RETIRED.MS_FLOWS. Related metrics: tm= a_bottleneck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_mac= hine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_o= peration", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions that were not fused", + "MetricExpr": "tma_light_operations * (BR_INST_RETIRED.ALL_BRANCHE= S - INST_RETIRED.MACRO_FUSED) / (tma_retiring * tma_info_thread_slots)", + "MetricGroup": "Branches;BvBO;Pipeline;Slots;TopdownL3;tma_L3_grou= p;tma_light_operations_group", + "MetricName": "tma_non_fused_branches", + "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to IEC or FPC RAT stalls, which can be due to= FIQ or IEC reservation stalls in which the integer, floating point or SIMD= scheduler is not able to accept uops", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER@ / (6 *= cpu_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_non_mem_scheduler", + "MetricThreshold": "(tma_non_mem_scheduler >0.10) & ((tma_resource= _bound >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of cycles that the oldest l= oad of the load buffer is stalled at retirement due to a pipeline block", - "MetricExpr": "100 * cpu_atom@LD_HEAD.L1_BOUND_AT_RET@ / cpu_atom@= CPU_CLK_UNHALTED.CORE@", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_store_bound_l1_bound", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring NOP (no op) instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", + "MetricGroup": "BvBO;Pipeline;Slots;TopdownL4;tma_L4_group;tma_oth= er_light_ops_group", + "MetricName": "tma_nop_instructions", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of cycles that the oldest l= oad of the load buffer is stalled at retirement", - "MetricExpr": "100 * (cpu_atom@LD_HEAD.L1_BOUND_AT_RET@ + cpu_atom= @MEM_BOUND_STALLS_LOAD.ALL@) / cpu_atom@CPU_CLK_UNHALTED.CORE@", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_store_bound_load_bound", + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that requires the use of m= icrocode (slow nuke)", + "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.NUKE@ / (6 * cpu_a= tom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", + "MetricName": "tma_nuke", + "MetricThreshold": "(tma_nuke >0.05) & ((tma_machine_clears >0.05)= & ((tma_bad_speculation >0.15)))", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of cycles the core is stall= ed due to store buffer full", - "MetricExpr": "100 * (cpu_atom@MEM_SCHEDULER_BLOCK.ST_BUF@ / cpu_a= tom@MEM_SCHEDULER_BLOCK.ALL@) * tma_mem_scheduler", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_store_bound_store_bound", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to other common frontend stalls not categor= ized.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.OTHER@ / (6 * cpu_atom@CP= U_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_other_fb", + "MetricThreshold": "(tma_other_fb >0.05) & ((tma_ifetch_bandwidth = >0.10) & ((tma_frontend_bound >0.20)))", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to floating point assists", - "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.FP_ASSIST@ / cpu_atom= @INST_RETIRED.ANY@", - "MetricName": "tma_info_machine_clear_bound_machine_clears_fp_assi= st_pki", + "BriefDescription": "This metric represents the remaining light uo= ps fraction the CPU has executed - remaining means not covered by other sib= ling nodes", + "MetricExpr": "max(0, tma_light_operations - (tma_fp_arith + tma_i= nt_operations + tma_memory_operations + tma_fused_instructions + tma_non_fu= sed_branches))", + "MetricGroup": "Pipeline;Slots;TopdownL3;tma_L3_group;tma_light_op= erations_group", + "MetricName": "tma_other_light_ops", + "MetricThreshold": "tma_other_light_ops > 0.3 & tma_light_operatio= ns > 0.6", + "PublicDescription": "This metric represents the remaining light u= ops fraction the CPU has executed - remaining means not covered by other si= bling nodes. May undercount due to FMA double counting", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to page faults", - "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.PAGE_FAULT@ / cpu_ato= m@INST_RETIRED.ANY@", - "MetricName": "tma_info_machine_clear_bound_machine_clears_page_fa= ult_pki", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", + "MetricGroup": "BrMispredicts;BvIO;Slots;TopdownL3;tma_L3_group;tm= a_branch_mispredicts_group", + "MetricName": "tma_other_mispredicts", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of machine clears relative = to thousands of instructions retired, due to self-modifying code", - "MetricExpr": "1e3 * cpu_atom@MACHINE_CLEARS.SMC@ / cpu_atom@INST_= RETIRED.ANY@", - "MetricName": "tma_info_machine_clear_bound_machine_clears_smc_pki= ", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", + "MetricGroup": "BvIO;Machine_Clears;Slots;TopdownL3;tma_L3_group;t= ma_machine_clears_group", + "MetricName": "tma_other_nukes", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of total non-speculative loads wit= h an address aliasing block", - "MetricExpr": "100 * cpu_atom@LD_BLOCKS.ADDRESS_ALIAS@ / cpu_atom@= MEM_UOPS_RETIRED.ALL_LOADS@", - "MetricName": "tma_info_mem_exec_blocks_%_loads_with_adressaliasin= g", + "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Page Faults", + "MetricExpr": "99 * ASSISTS.PAGE_FAULT / tma_info_thread_slots", + "MetricGroup": "Slots_Estimated;TopdownL5;tma_L5_group;tma_assists= _group", + "MetricName": "tma_page_faults", + "MetricThreshold": "tma_page_faults > 0.05", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of total non-speculative loads wit= h a store forward or unknown store address block", - "MetricExpr": "100 * cpu_atom@LD_BLOCKS.DATA_UNKNOWN@ / cpu_atom@M= EM_UOPS_RETIRED.ALL_LOADS@", - "MetricName": "tma_info_mem_exec_blocks_%_loads_with_storefwdblk", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd b= ranch)", + "MetricExpr": "UOPS_DISPATCHED.PORT_0 / tma_info_core_core_clks", + "MetricGroup": "Compute;Core_Clocks;TopdownL6;tma_L6_group;tma_alu= _op_utilization_group;tma_issue2P", + "MetricName": "tma_port_0", + "MetricThreshold": "tma_port_0 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_12= 8b, tma_int_vector_256b, tma_port_1, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of Memory Execution Bound due to a= first level data cache miss", - "MetricExpr": "100 * cpu_atom@LD_HEAD.L1_MISS_AT_RET@ / cpu_atom@L= D_HEAD.ANY_AT_RET@", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_l1miss", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 1 (ALU)", + "MetricExpr": "UOPS_DISPATCHED.PORT_1 / tma_info_core_core_clks", + "MetricGroup": "Core_Clocks;TopdownL6;tma_L6_group;tma_alu_op_util= ization_group;tma_issue2P", + "MetricName": "tma_port_1", + "MetricThreshold": "tma_port_1 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b, tma_port_= 0, tma_port_6, tma_ports_utilized_2", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of Memory Execution Bound due to o= ther block cases, such as pipeline conflicts, fences, etc", - "MetricExpr": "100 * cpu_atom@LD_HEAD.OTHER_AT_RET@ / cpu_atom@LD_= HEAD.ANY_AT_RET@", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_otherpipeli= neblks", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simple= ALU)", + "MetricExpr": "UOPS_DISPATCHED.PORT_6 / tma_info_core_core_clks", + "MetricGroup": "Core_Clocks;TopdownL6;tma_L6_group;tma_alu_op_util= ization_group;tma_issue2P", + "MetricName": "tma_port_6", + "MetricThreshold": "tma_port_6 > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128= b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_ports_utilized_2", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of Memory Execution Bound due to a= pagewalk", - "MetricExpr": "100 * cpu_atom@LD_HEAD.PGWALK_AT_RET@ / cpu_atom@LD= _HEAD.ANY_AT_RET@", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_pagewalk", + "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", + "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_3_PORTS_UTIL)) / tm= a_info_thread_clks if ARITH.DIV_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - EXE_= ACTIVITY.BOUND_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EX= E_ACTIVITY.2_3_PORTS_UTIL) / tma_info_thread_clks)", + "MetricGroup": "Clocks;PortsUtil;TopdownL3;tma_L3_group;tma_core_b= ound_group", + "MetricName": "tma_ports_utilization", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of Memory Execution Bound due to a= second level TLB miss", - "MetricExpr": "100 * cpu_atom@LD_HEAD.DTLB_MISS_AT_RET@ / cpu_atom= @LD_HEAD.ANY_AT_RET@", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_stlbhit", + "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", + "MetricExpr": "max(EXE_ACTIVITY.EXE_BOUND_0_PORTS - RESOURCE_STALL= S.SCOREBOARD, 0) / tma_info_thread_clks", + "MetricGroup": "Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_ports_= utilization_group", + "MetricName": "tma_ports_utilized_0", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of Memory Execution Bound due to a= store forward address match", - "MetricExpr": "100 * cpu_atom@LD_HEAD.ST_ADDR_AT_RET@ / cpu_atom@L= D_HEAD.ANY_AT_RET@", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_storefwding= ", + "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", + "MetricGroup": "Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_issueL= 1;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_1", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per Load", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@MEM_UOPS_RETI= RED.ALL_LOADS@", - "MetricName": "tma_info_mem_mix_ipload", + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", + "MetricGroup": "Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_issue2= P;tma_ports_utilization_group", + "MetricName": "tma_ports_utilized_2", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma= _int_vector_256b, tma_port_0, tma_port_1, tma_port_6", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Instructions per Store", - "MetricExpr": "cpu_atom@INST_RETIRED.ANY@ / cpu_atom@MEM_UOPS_RETI= RED.ALL_STORES@", - "MetricName": "tma_info_mem_mix_ipstore", + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", + "MetricGroup": "BvCB;Clocks;PortsUtil;TopdownL4;tma_L4_group;tma_p= orts_utilization_group", + "MetricName": "tma_ports_utilized_3m", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of total non-speculative loads tha= t perform one or more locks", - "MetricExpr": "100 * cpu_atom@MEM_UOPS_RETIRED.LOCK_LOADS@ / cpu_a= tom@MEM_UOPS_RETIRED.ALL_LOADS@", - "MetricName": "tma_info_mem_mix_load_locks_ratio", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to wrong predecodes.", + "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.PREDECODE@ / (6 * cpu_ato= m@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricName": "tma_predecode", + "MetricThreshold": "(tma_predecode >0.05) & ((tma_ifetch_bandwidth= >0.10) & ((tma_frontend_bound >0.20)))", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of total non-speculative loads tha= t are splits", - "MetricExpr": "100 * cpu_atom@MEM_UOPS_RETIRED.SPLIT_LOADS@ / cpu_= atom@MEM_UOPS_RETIRED.ALL_LOADS@", - "MetricName": "tma_info_mem_mix_load_splits_ratio", + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the physical register file unable to accep= t an entry (marble stalls)", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REGISTER@ / (6 * cpu_atom= @CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_register", + "MetricThreshold": "(tma_register >0.10) & ((tma_resource_bound >0= .20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Ratio of mem load uops to all uops", - "MetricExpr": "1e3 * cpu_atom@MEM_UOPS_RETIRED.ALL_LOADS@ / cpu_at= om@TOPDOWN_RETIRING.ALL_P@", - "MetricName": "tma_info_mem_mix_memload_ratio", + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the reorder buffer being full (ROB stalls)= ", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REORDER_BUFFER@ / (6 * cp= u_atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_reorder_buffer", + "MetricThreshold": "(tma_reorder_buffer >0.10) & ((tma_resource_bo= und >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of time that the core is stalled d= ue to a TPAUSE or UMWAIT instruction", - "MetricExpr": "100 * cpu_atom@SERIALIZATION.C01_MS_SCB@ / (6 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", - "MetricName": "tma_info_serialization _%_tpause_cycles", + "BriefDescription": "Counts the number of cycles the core is stall= ed due to a resource limitation", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.ALL_P@ / (6 * cpu_atom@CP= U_CLK_UNHALTED.CORE@) - tma_core_bound", + "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricName": "tma_resource_bound", + "MetricThreshold": "(tma_resource_bound >0.20) & ((tma_backend_bou= nd >0.10))", + "MetricgroupNoGroup": "TopdownL2", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Average CPU Utilization", - "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.REF_TSC@ / TSC", - "MetricName": "tma_info_system_cpu_utilization", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by (indirect) RET instruction= s", + "MetricExpr": "BR_MISP_RETIRED.RET_COST * cpu_core@BR_MISP_RETIRED= .RET_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;Clocks_Retired;TopdownL3;tma_L3_grou= p;tma_branch_mispredicts_group", + "MetricName": "tma_ret_mispredicts", + "MetricThreshold": "tma_ret_mispredicts > 0.05 & tma_branch_mispre= dicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "cpu_atom@FP_FLOPS_RETIRED.ALL@ / (duration_time * 1= e9)", - "MetricGroup": "Flops", - "MetricName": "tma_info_system_gflops", - "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width", + "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "TopdownL1;tma_L1_group", + "MetricName": "tma_retiring", + "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots u= tilized by useful work i.e. issued uops that eventually get retired. Ideall= y; all pipeline slots would be attributed to the Retiring category. Retiri= ng of 100% would indicate the maximum Pipeline_Width throughput was achieve= d. Maximizing Retiring typically increases the Instructions-per-cycle (see= IPC metric). Note that a high Retiring value does not necessary mean there= is no room for more performance. For example; Heavy-operations or Microco= de Assists are categorized under Retiring. They often indicate suboptimal p= erformance and can often be optimized or avoided. Sample with: UOPS_RETIRED= .SLOTS", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Fraction of cycles spent in Kernel mode", - "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.CORE_P@k / cpu_atom@CPU_C= LK_UNHALTED.CORE@", - "MetricGroup": "Summary", - "MetricName": "tma_info_system_kernel_utilization", + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to scoreboards from the instruction queue (IQ= ), jump execution unit (JEU), or microcode sequencer (MS)", + "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.SERIALIZATION@ / (6 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", + "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricName": "tma_serialization", + "MetricThreshold": "(tma_serialization >0.10) & ((tma_resource_bou= nd >0.20) & ((tma_backend_bound >0.10)))", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", - "MetricExpr": "cpu_atom@CPU_CLK_UNHALTED.CORE@ / cpu_atom@CPU_CLK_= UNHALTED.REF_TSC@", - "MetricGroup": "Power", - "MetricName": "tma_info_system_turbo_utilization", + "BriefDescription": "This metric represents fraction of cycles the= CPU issue-pipeline was stalled due to serializing operations", + "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_thread_clks += tma_c02_wait", + "MetricGroup": "BvIO;Clocks;PortsUtil;TopdownL3;tma_L3_group;tma_c= ore_bound_group;tma_issueSO", + "MetricName": "tma_serializing_operation", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of all uops which are FPDiv uops", - "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.FPDIV@ / cpu_atom@TOPDO= WN_RETIRING.ALL_P@", - "MetricName": "tma_info_uop_mix_fpdiv_uop_ratio", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring Shuffle operations of 256-bit vector size (FP or Int= eger)", + "MetricExpr": "tma_light_operations * INT_VEC_RETIRED.SHUFFLES / (= tma_retiring * tma_info_thread_slots)", + "MetricGroup": "HPC;Pipeline;Slots;TopdownL4;tma_L4_group;tma_othe= r_light_ops_group", + "MetricName": "tma_shuffles_256b", + "MetricThreshold": "tma_shuffles_256b > 0.1 & tma_other_light_ops = > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of all uops which are IDiv uops", - "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.IDIV@ / cpu_atom@TOPDOW= N_RETIRING.ALL_P@", - "MetricName": "tma_info_uop_mix_idiv_uop_ratio", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to PAUSE Instructions", + "MetricConstraint": "NO_GROUP_EVENTS_NMI", + "MetricExpr": "CPU_CLK_UNHALTED.PAUSE / tma_info_thread_clks", + "MetricGroup": "Clocks;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", + "MetricName": "tma_slow_pause", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.= PAUSE_INST", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of all uops which are microcode op= s", - "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.MS@ / cpu_atom@TOPDOWN_= RETIRING.ALL_P@", - "MetricName": "tma_info_uop_mix_microcode_uop_ratio", + "BriefDescription": "This metric estimates fraction of cycles hand= ling memory load split accesses - load that cross 64-byte cache line bounda= ry", + "MetricExpr": "(min(MEM_INST_RETIRED.SPLIT_LOADS * cpu_core@MEM_IN= ST_RETIRED.SPLIT_LOADS@R, MEM_INST_RETIRED.SPLIT_LOADS * tma_info_memory_lo= ad_miss_real_latency) if 0 < cpu_core@MEM_INST_RETIRED.SPLIT_LOADS@R else M= EM_INST_RETIRED.SPLIT_LOADS * tma_info_memory_load_miss_real_latency) / tma= _info_thread_clks", + "MetricGroup": "Clocks_Calculated;TopdownL4;tma_L4_group;tma_l1_bo= und_group", + "MetricName": "tma_split_loads", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Percentage of all uops which are x87 uops", - "MetricExpr": "100 * cpu_atom@UOPS_RETIRED.X87@ / cpu_atom@TOPDOWN= _RETIRING.ALL_P@", - "MetricName": "tma_info_uop_mix_x87_uop_ratio", + "BriefDescription": "This metric represents rate of split store ac= cesses", + "MetricExpr": "(min(MEM_INST_RETIRED.SPLIT_STORES * cpu_core@MEM_I= NST_RETIRED.SPLIT_STORES@R, MEM_INST_RETIRED.SPLIT_STORES) if 0 < cpu_core@= MEM_INST_RETIRED.SPLIT_STORES@R else MEM_INST_RETIRED.SPLIT_STORES) / tma_i= nfo_thread_clks", + "MetricGroup": "Core_Utilization;TopdownL4;tma_L4_group;tma_issueS= pSt;tma_store_bound_group", + "MetricName": "tma_split_stores", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", + "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to Instruction Table Lookaside Buffer (ITLB= ) misses.", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.ITLB_MISS@ / (6 * cpu_ato= m@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", - "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_ifetch_latency >= 0.15 & tma_frontend_bound > 0.2)", + "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", + "MetricExpr": "(XQ.FULL_CYCLES + L1D_PEND_MISS.L2_STALLS) / tma_in= fo_thread_clks", + "MetricGroup": "BvMB;Clocks;MemoryBW;Offcore;TopdownL4;tma_L4_grou= p;tma_issueBW;tma_l3_bound_group", + "MetricName": "tma_sq_full", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_mem_bandwidth", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend because allocation is stalled due to a mach= ine clear (nuke) of any kind including memory ordering and memory disambigu= ation", - "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.MACHINE_CLEARS@ / = (6 * cpu_atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", - "MetricName": "tma_machine_clears", - "MetricThreshold": "tma_machine_clears > 0.05 & tma_bad_speculatio= n > 0.15", - "MetricgroupNoGroup": "TopdownL2", + "BriefDescription": "This metric estimates how often CPU was stall= ed due to RFO store memory accesses; RFO store issue a read-for-ownership = request before the write", + "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", + "MetricGroup": "MemoryBound;Stalls;TmaL3mem;TopdownL3;tma_L3_group= ;tma_memory_bound_group", + "MetricName": "tma_store_bound", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to memory reservation stalls in which a sched= uler is not able to accept uops", - "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.MEM_SCHEDULER@ / (6 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", - "MetricName": "tma_mem_scheduler", - "MetricThreshold": "tma_mem_scheduler > 0.1 & (tma_resource_bound = > 0.2 & tma_backend_bound > 0.1)", + "BriefDescription": "This metric roughly estimates fraction of cyc= les when the memory subsystem had loads blocked since they could not forwar= d data from earlier (in program order) overlapping stores", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", + "MetricGroup": "Clocks_Estimated;TopdownL4;tma_L4_group;tma_l1_bou= nd_group", + "MetricName": "tma_store_fwd_blk", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to IEC or FPC RAT stalls, which can be due to= FIQ or IEC reservation stalls in which the integer, floating point or SIMD= scheduler is not able to accept uops", - "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER@ / (6 *= cpu_atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", - "MetricName": "tma_non_mem_scheduler", - "MetricThreshold": "tma_non_mem_scheduler > 0.1 & (tma_resource_bo= und > 0.2 & tma_backend_bound > 0.1)", + "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", + "MetricExpr": "(MEM_STORE_RETIRED.L2_HIT * 10 * (1 - MEM_INST_RETI= RED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_= LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE= _REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", + "MetricGroup": "BvML;Clocks_Estimated;LockCont;MemoryLat;Offcore;T= opdownL4;tma_L4_group;tma_issueRFO;tma_issueSL;tma_overlap;tma_store_bound_= group", + "MetricName": "tma_store_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that requires the use of m= icrocode (slow nuke)", - "MetricExpr": "cpu_atom@TOPDOWN_BAD_SPECULATION.NUKE@ / (6 * cpu_a= tom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", - "MetricName": "tma_nuke", - "MetricThreshold": "tma_nuke > 0.05 & (tma_machine_clears > 0.05 &= tma_bad_speculation > 0.15)", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Store operations", + "MetricExpr": "(UOPS_DISPATCHED.PORT_4_9 + UOPS_DISPATCHED.PORT_7_= 8) / (4 * tma_info_core_core_clks)", + "MetricGroup": "Core_Execution;TopdownL5;tma_L5_group;tma_ports_ut= ilized_3m_group", + "MetricName": "tma_store_op_utilization", + "MetricThreshold": "tma_store_op_utilization > 0.6", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port for Store operations. Sample with:= UOPS_DISPATCHED.PORT_7_8", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to other common frontend stalls not categor= ized.", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.OTHER@ / (6 * cpu_atom@CP= U_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", - "MetricName": "tma_other_fb", - "MetricThreshold": "tma_other_fb > 0.05 & (tma_ifetch_bandwidth > = 0.1 & tma_frontend_bound > 0.2)", + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the TLB was missed by store accesses, hitting in the second-l= evel TLB (STLB)", + "MetricExpr": "max(0, tma_dtlb_store - tma_store_stlb_miss)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL5;tma_L5_group;= tma_dtlb_store_group", + "MetricName": "tma_store_stlb_hit", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to wrong predecodes.", - "MetricExpr": "cpu_atom@TOPDOWN_FE_BOUND.PREDECODE@ / (6 * cpu_ato= m@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", - "MetricName": "tma_predecode", - "MetricThreshold": "tma_predecode > 0.05 & (tma_ifetch_bandwidth >= 0.1 & tma_frontend_bound > 0.2)", + "BriefDescription": "This metric estimates the fraction of cycles = where the STLB was missed by store accesses, performing a hardware page wal= k", + "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", + "MetricGroup": "Clocks_Calculated;MemoryTLB;TopdownL5;tma_L5_group= ;tma_dtlb_store_group", + "MetricName": "tma_store_stlb_miss", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the physical register file unable to accep= t an entry (marble stalls)", - "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REGISTER@ / (6 * cpu_atom= @CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", - "MetricName": "tma_register", - "MetricThreshold": "tma_register > 0.1 & (tma_resource_bound > 0.2= & tma_backend_bound > 0.1)", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_store_stlb_miss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the reorder buffer being full (ROB stalls)= ", - "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.REORDER_BUFFER@ / (6 * cp= u_atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", - "MetricName": "tma_reorder_buffer", - "MetricThreshold": "tma_reorder_buffer > 0.1 & (tma_resource_bound= > 0.2 & tma_backend_bound > 0.1)", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_store_stlb_miss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of cycles the core is stall= ed due to a resource limitation", - "MetricExpr": "tma_backend_bound - tma_core_bound", - "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", - "MetricName": "tma_resource_bound", - "MetricThreshold": "tma_resource_bound > 0.2 & tma_backend_bound >= 0.1", - "MetricgroupNoGroup": "TopdownL2", + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "Clocks_Estimated;MemoryTLB;TopdownL6;tma_L6_group;= tma_store_stlb_miss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that result = in retirement slots", - "MetricExpr": "cpu_atom@TOPDOWN_RETIRING.ALL_P@ / (6 * cpu_atom@CP= U_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL1;tma_L1_group", - "MetricName": "tma_retiring", - "MetricThreshold": "tma_retiring > 0.75", - "MetricgroupNoGroup": "TopdownL1", + "BriefDescription": "This metric estimates how often CPU was stall= ed due to Streaming store memory accesses; Streaming store optimize out a = read request required by RFO stores", + "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", + "MetricGroup": "Clocks_Estimated;MemoryBW;Offcore;TopdownL4;tma_L4= _group;tma_issueSmSt;tma_store_bound_group", + "MetricName": "tma_streaming_stores", + "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", + "ScaleUnit": "100%", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to new branch address clears", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / tma_info_thread_cl= ks", + "MetricGroup": "BigFootprint;BvBC;Clocks;FetchLat;TopdownL4;tma_L4= _group;tma_branch_resteers_group", + "MetricName": "tma_unknown_branches", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", "ScaleUnit": "100%", "Unit": "cpu_atom" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to scoreboards from the instruction queue (IQ= ), jump execution unit (JEU), or microcode sequencer (MS)", - "MetricExpr": "cpu_atom@TOPDOWN_BE_BOUND.SERIALIZATION@ / (6 * cpu= _atom@CPU_CLK_UNHALTED.CORE@)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", - "MetricName": "tma_serialization", - "MetricThreshold": "tma_serialization > 0.1 & (tma_resource_bound = > 0.2 & tma_backend_bound > 0.1)", + "BriefDescription": "This metric serves as an approximation of leg= acy x87 usage", + "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", + "MetricGroup": "Compute;TopdownL4;Uops;tma_L4_group;tma_fp_arith_g= roup", + "MetricName": "tma_x87_use", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", "ScaleUnit": "100%", "Unit": "cpu_atom" }, @@ -708,8 +2712,8 @@ "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", - "MetricExpr": "(cpu_core@UOPS_DISPATCHED.PORT_0@ + cpu_core@UOPS_D= ISPATCHED.PORT_1@ + cpu_core@UOPS_DISPATCHED.PORT_5_11@ + cpu_core@UOPS_DIS= PATCHED.PORT_6@) / (5 * tma_info_core_core_clks)", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", + "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + = UOPS_DISPATCHED.PORT_5_11 + UOPS_DISPATCHED.PORT_6) / (5 * tma_info_core_co= re_clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", "MetricThreshold": "tma_alu_op_utilization > 0.4", @@ -718,17 +2722,17 @@ }, { "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops delivered by the Microcode_Sequencer as a result of Assists= ", - "MetricExpr": "78 * cpu_core@ASSISTS.ANY@ / tma_info_thread_slots", + "MetricExpr": "78 * ASSISTS.ANY / tma_info_thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists.", - "MetricExpr": "63 * cpu_core@ASSISTS.SSE_AVX_MIX@ / tma_info_threa= d_slots", + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists", + "MetricExpr": "63 * ASSISTS.SSE_AVX_MIX / tma_info_thread_slots", "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_avx_assists", "MetricThreshold": "tma_avx_assists > 0.1", @@ -737,7 +2741,7 @@ }, { "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", - "MetricExpr": "cpu_core@topdown\\-be\\-bound@ / (cpu_core@topdown\= \-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retirin= g@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", + "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", "MetricGroup": "BvOB;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", @@ -753,46 +2757,151 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%", "Unit": "cpu_core" }, + { + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Ret", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_fb_full)= ))", + "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_mem_b= andwidth, tma_sq_full", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_= bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_d= ram_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtlb_load + tma_st= ore_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_load= s + tma_fb_full)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_= l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_split_l= oads / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma= _lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bound * (tma_s= tore_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_split_stores / (tma_store_latency + tma_false_sha= ring + tma_split_stores + tma_streaming_stores + tma_dtlb_store)) + tma_mem= ory_bound * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound = + tma_dram_bound + tma_store_bound)) * (tma_store_latency / (tma_store_late= ncy + tma_false_sharing + tma_split_stores + tma_streaming_stores + tma_dtl= b_store)))", + "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D0x1@) * (t= ma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_res= teers + tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispr= edicts) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_bra= nches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma_= ms_switches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms / = (tma_mite + tma_dsb + tma_lsd + tma_ms))) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Fed;FetchBW;Frontend", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu_core@U= OPS_RETIRED.MS\\,cmask\\=3D0x1@) * (tma_fetch_latency * (tma_ms_switches + = tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma= _other_mispredicts / tma_branch_mispredicts) / (tma_mispredicts_resteers + = tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_itl= b_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switch= es) + tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_lsd + tma_ms= )) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / t= ma_other_nukes + tma_core_bound * (tma_serializing_operation + RS.EMPTY_RES= OURCE / tma_info_thread_clks * tma_ports_utilized_0) / (tma_divider + tma_s= erializing_operation + tma_ports_utilization) + tma_microcode_sequencer / (= tma_few_uops_instructions + tma_microcode_sequencer) * (tma_assists / tma_m= icrocode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (= tma_dtlb_load / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_depende= ncy + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bound= * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dra= m_bound + tma_store_bound)) * (tma_dtlb_store / (tma_store_latency + tma_fa= lse_sharing + tma_split_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Offcore", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Ret", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20", + "Unit": "cpu_core" + }, { "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", - "MetricExpr": "cpu_core@topdown\\-br\\-mispredict@ / (cpu_core@top= down\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-re= tiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", + "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * sl= ots", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueBM", "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredict= ions, tma_mispredicts_resteers", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost,= tma_mispredicts_resteers", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers", - "MetricExpr": "cpu_core@INT_MISC.CLEAR_RESTEER_CYCLES@ / tma_info_= thread_clks + tma_unknown_branches", + "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings).", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.C01@ / tma_info_thread_cl= ks", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C01 / tma_info_thread_clks", "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", "MetricName": "tma_c01_wait", - "MetricThreshold": "tma_c01_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_c01_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings).", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.C02@ / tma_info_thread_cl= ks", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings)", + "MetricExpr": "CPU_CLK_UNHALTED.C02 / tma_info_thread_clks", "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", "MetricName": "tma_c02_wait", - "MetricThreshold": "tma_c02_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_c02_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -801,28 +2910,100 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources. Sample with: FRONTEND_RETIRE= D.MS_FLOWS", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Machine Clears", - "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * cpu_core@INT_MISC.CLEAR_RESTEER_CYCLES@ / tma_info_thread_clks", + "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_core" }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "MetricExpr": "max(0, FRONTEND_RETIRED.L1I_MISS * cpu_core@FRONTEN= D_RETIRED.L1I_MISS@R / tma_info_thread_clks - tma_code_l2_miss)", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_hit", + "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "MetricExpr": "FRONTEND_RETIRED.L2_MISS * cpu_core@FRONTEND_RETIRE= D.L2_MISS@R / tma_info_thread_clks", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_miss", + "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, FRONTEND_RETIRED.ITLB_MISS * cpu_core@FRONTE= ND_RETIRED.ITLB_MISS@R / tma_info_thread_clks - tma_code_stlb_miss)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "FRONTEND_RETIRED.STLB_MISS * cpu_core@FRONTEND_RETI= RED.STLB_MISS@R / tma_info_thread_clks", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks * IT= LB_MISSES.WALK_COMPLETED_2M_4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISS= ES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks * IT= LB_MISSES.WALK_COMPLETED_4K / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.= WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by non-taken conditional bran= ches", + "MetricExpr": "BR_MISP_RETIRED.COND_NTAKEN_COST * cpu_core@BR_MISP= _RETIRED.COND_NTAKEN_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_cond_nt_mispredicts", + "MetricThreshold": "tma_cond_nt_mispredicts > 0.05 & tma_branch_mi= spredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to misprediction by taken conditional branches", + "MetricExpr": "BR_MISP_RETIRED.COND_TAKEN_COST * cpu_core@BR_MISP_= RETIRED.COND_TAKEN_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_cond_tk_mispredicts", + "MetricThreshold": "tma_cond_tk_mispredicts > 0.05 & tma_branch_mi= spredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", - "MetricExpr": "(cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@ * min(= cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@R, 24 * tma_info_system_core_fre= quency) + cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@ * min(cpu_core@MEM_LOA= D_L3_HIT_RETIRED.XSNP_FWD@R, 25 * tma_info_system_core_frequency) * (cpu_co= re@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM@ / (cpu_core@OCR.DEMAND_DATA_RD.L3_= HIT.SNOOP_HITM@ + cpu_core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD@)))= * (1 + cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_MI= SS@ / 2) / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricExpr": "((min(MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS * cpu_core@= MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS * (2= 7 * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequency) i= f 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS@R else MEM_LOAD_L3_HIT_RET= IRED.XSNP_MISS * (27 * tma_info_system_core_frequency) - 3 * tma_info_syste= m_core_frequency) + (min(MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * cpu_core@MEM_LO= AD_L3_HIT_RETIRED.XSNP_FWD@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (28 * tma_= info_system_core_frequency) - 3 * tma_info_system_core_frequency) if 0 < cp= u_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@R else MEM_LOAD_L3_HIT_RETIRED.XSNP= _FWD * (28 * tma_info_system_core_frequency) - 3 * tma_info_system_core_fre= quency) * (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HI= T.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_L= OAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related m= etrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma_remote= _cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false= _sharing, tma_machine_clears", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -833,107 +3014,107 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", - "MetricExpr": "(cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@ * mi= n(cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@R, 24 * tma_info_system_core= _frequency) + cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@ * min(cpu_core@MEM= _LOAD_L3_HIT_RETIRED.XSNP_FWD@R, 24 * tma_info_system_core_frequency) * (1 = - cpu_core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM@ / (cpu_core@OCR.DEMAND_DAT= A_RD.L3_HIT.SNOOP_HITM@ + cpu_core@OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH= _FWD@))) * (1 + cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIR= ED.L1_MISS@ / 2) / tma_info_thread_clks", + "MetricExpr": "((min(MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD * cpu_cor= e@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FW= D * (27 * tma_info_system_core_frequency) - 3 * tma_info_system_core_freque= ncy) if 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD@R else MEM_LOAD_L3= _HIT_RETIRED.XSNP_NO_FWD * (27 * tma_info_system_core_frequency) - 3 * tma_= info_system_core_frequency) + (min(MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * cpu_c= ore@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@R, MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * = (27 * tma_info_system_core_frequency) - 3 * tma_info_system_core_frequency)= if 0 < cpu_core@MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD@R else MEM_LOAD_L3_HIT_RE= TIRED.XSNP_FWD * (27 * tma_info_system_core_frequency) - 3 * tma_info_syste= m_core_frequency) * (1 - OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND= _DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))= ) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info= _thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu_core@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cp= u_core@INST_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu_core@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - = cpu_core@INST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks = / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", + "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles whe= re the Divider unit was active", - "MetricExpr": "cpu_core@ARITH.DIV_ACTIVE@ / tma_info_thread_clks", + "MetricExpr": "ARITH.DIV_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", - "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_UOPS", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIV_ACTIVE", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates how often the CPU was s= talled on accesses to external memory (DRAM) by loads", - "MetricExpr": "cpu_core@MEMORY_ACTIVITY.STALLS_L3_MISS@ / tma_info= _thread_clks", + "MetricExpr": "MEMORY_ACTIVITY.STALLS_L3_MISS / tma_info_thread_cl= ks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS_PS", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to DSB (decoded uop cache) fetch pipe= line", - "MetricExpr": "(cpu_core@IDQ.DSB_CYCLES_ANY@ - cpu_core@IDQ.DSB_CY= CLES_OK@) / tma_info_core_core_clks / 2", + "MetricExpr": "(IDQ.DSB_CYCLES_ANY - IDQ.DSB_CYCLES_OK) / tma_info= _core_core_clks / 2", "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to switches from DSB to MITE pipelines", - "MetricExpr": "cpu_core@DSB2MITE_SWITCHES.PENALTY_CYCLES@ / tma_in= fo_thread_clks", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "cpu_core@MEM_INST_RETIRED.STLB_HIT_LOADS@ * min(cpu= _core@MEM_INST_RETIRED.STLB_HIT_LOADS@R, 7) / tma_info_thread_clks + tma_lo= ad_stlb_miss", + "MetricExpr": "(min(MEM_INST_RETIRED.STLB_HIT_LOADS * cpu_core@MEM= _INST_RETIRED.STLB_HIT_LOADS@R, MEM_INST_RETIRED.STLB_HIT_LOADS * 7) if 0 <= cpu_core@MEM_INST_RETIRED.STLB_HIT_LOADS@R else MEM_INST_RETIRED.STLB_HIT_= LOADS * 7) / tma_info_thread_clks + tma_load_stlb_miss", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store, tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_me= mory_synchronization", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "cpu_core@MEM_INST_RETIRED.STLB_HIT_STORES@ * min(cp= u_core@MEM_INST_RETIRED.STLB_HIT_STORES@R, 7) / tma_info_thread_clks + tma_= store_stlb_miss", + "MetricExpr": "(min(MEM_INST_RETIRED.STLB_HIT_STORES * cpu_core@ME= M_INST_RETIRED.STLB_HIT_STORES@R, MEM_INST_RETIRED.STLB_HIT_STORES * 7) if = 0 < cpu_core@MEM_INST_RETIRED.STLB_HIT_STORES@R else MEM_INST_RETIRED.STLB_= HIT_STORES * 7) / tma_info_thread_clks + tma_store_stlb_miss", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, = tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_memory_synchroniz= ation", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", - "MetricExpr": "28 * tma_info_system_core_frequency * cpu_core@OCR.= DEMAND_RFO.L3_HIT.SNOOP_HITM@ / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricExpr": "28 * tma_info_system_core_frequency * OCR.DEMAND_RF= O.L3_HIT.SNOOP_HITM / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing= , tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", - "MetricExpr": "cpu_core@L1D_PEND_MISS.FB_FULL@ / tma_info_thread_c= lks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricExpr": "L1D_PEND_MISS.FB_FULL / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_bottleneck_cache_memory_bandwi= dth, tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store= _latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_mem_bandwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -944,28 +3125,28 @@ "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LA= TENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_sw= itches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tm= a_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend latency issues", - "MetricExpr": "cpu_core@topdown\\-fetch\\-lat@ / (cpu_core@topdown= \\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiri= ng@ + cpu_core@topdown\\-be\\-bound@) - cpu_core@INT_MISC.UOP_DROPPING@ / t= ma_info_thread_slots", + "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.U= OP_DROPPING / tma_info_thread_slots", "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend= _bound_group", "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or up to= ([SNB+] four; [ADL+] five) uops", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", "MetricExpr": "max(0, tma_heavy_operations - tma_microcode_sequenc= er)", "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", "MetricName": "tma_few_uops_instructions", "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or up t= o ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number o= f uops in such instructions. Related metrics: tma_decoder0_alone", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions.= Related metrics: tma_decoder0_alone", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -975,137 +3156,164 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Floating Point (FP) Assists", - "MetricExpr": "30 * cpu_core@ASSISTS.FP@ / tma_info_thread_slots", + "MetricExpr": "30 * ASSISTS.FP / tma_info_thread_slots", "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "MetricExpr": "ARITH.FPDIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_fp_divider", + "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric approximates arithmetic floating-= point (FP) scalar uops fraction the CPU has retired", - "MetricExpr": "cpu_core@FP_ARITH_INST_RETIRED.SCALAR@ / (tma_retir= ing * tma_info_thread_slots)", + "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_2= 56b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma= _port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric approximates arithmetic floating-= point (FP) vector uops fraction the CPU has retired aggregated across all v= ector widths", - "MetricExpr": "cpu_core@FP_ARITH_INST_RETIRED.VECTOR@ / (tma_retir= ing * tma_info_thread_slots)", + "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized= _2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 128-bit wide vectors", - "MetricExpr": "(cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE@= + cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE@) / (tma_retiring * tm= a_info_thread_slots)", + "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_= 2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric approximates arithmetic FP vector= uops fraction the CPU has retired for 256-bit wide vectors", - "MetricExpr": "(cpu_core@FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE@= + cpu_core@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE@) / (tma_retiring * tm= a_info_thread_slots)", + "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_= 2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_int_vector_128b, tma_int_vector_256b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This category represents fraction of slots wh= ere the processor's Frontend undersupplies its Backend", - "MetricExpr": "cpu_core@topdown\\-fe\\-bound@ / (cpu_core@topdown\= \-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retirin= g@ + cpu_core@topdown\\-be\\-bound@) - cpu_core@INT_MISC.UOP_DROPPING@ / tm= a_info_thread_slots", + "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UO= P_DROPPING / tma_info_thread_slots", "MetricGroup": "BvFB;BvIO;PGO;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions -- where one uop can represent mu= ltiple contiguous instructions", - "MetricExpr": "tma_light_operations * cpu_core@INST_RETIRED.MACRO_= FUSED@ / (tma_retiring * tma_info_thread_slots)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", + "MetricExpr": "tma_light_operations * INST_RETIRED.MACRO_FUSED / (= tma_retiring * tma_info_thread_slots)", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_fused_instructions", "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions -- where one uop can represent m= ultiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of = legacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Ot= her_Light_Ops in MTL!)}", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", - "MetricExpr": "cpu_core@topdown\\-heavy\\-ops@ / (cpu_core@topdown= \\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiri= ng@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .). Sample with: UOPS= _RETIRED.HEAVY", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+]). Sample with: UOPS_RET= IRED.HEAVY", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", - "MetricExpr": "cpu_core@ICACHE_DATA.STALLS@ / tma_info_thread_clks= ", + "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect CALL instructions= ", + "MetricExpr": "BR_MISP_RETIRED.INDIRECT_CALL_COST * cpu_core@BR_MI= SP_RETIRED.INDIRECT_CALL_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_ind_call_mispredicts", + "MetricThreshold": "tma_ind_call_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by indirect JMP instructions", + "MetricExpr": "max((BR_MISP_RETIRED.INDIRECT_COST * cpu_core@BR_MI= SP_RETIRED.INDIRECT_COST@R - BR_MISP_RETIRED.INDIRECT_CALL_COST * cpu_core@= BR_MISP_RETIRED.INDIRECT_CALL_COST@R) / tma_info_thread_clks, 0)", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_ind_jump_mispredicts", + "MetricThreshold": "tma_ind_jump_mispredicts > 0.05 & tma_branch_m= ispredicts > 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA sl= ots wasted per non-speculative branch misprediction (retired JEClear)", - "MetricExpr": "tma_info_bottleneck_mispredictions * tma_info_threa= d_slots / cpu_core@BR_MISP_RETIRED.ALL_BRANCHES@ / 100", + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 6 / BR_MISP_RETIRED.ALL_BRANCHES / 100", "MetricGroup": "Bad;BrMispredicts;tma_issueBM", "MetricName": "tma_info_bad_spec_branch_misprediction_cost", - "PublicDescription": "Branch Misprediction Cost: Fraction of TMA s= lots wasted per non-speculative branch misprediction (retired JEClear). Rel= ated metrics: tma_branch_mispredicts, tma_info_bottleneck_mispredictions, t= ma_mispredicts_resteers", + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate).", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.COND_NTAKEN@", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional taken branches (lower number means higher occurrence rate).", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.COND_TAKEN@", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_taken < 200", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.INDIRECT@", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3", + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000", "Unit": "cpu_core" }, { - "BriefDescription": "Instructions per retired mispredicts for retu= rn branches (lower number means higher occurrence rate).", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.RET@", + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_ret", "MetricThreshold": "tma_info_bad_spec_ipmisp_ret < 500", @@ -1113,15 +3321,15 @@ }, { "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_MISP_RETIR= ED.ALL_BRANCHES@", + "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.ALL_BRANCHES", "MetricGroup": "Bad;BadSpec;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmispredict", "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200", "Unit": "cpu_core" }, { - "BriefDescription": "Speculative to Retired ratio of all clears (c= overing mispredicts and nukes)", - "MetricExpr": "cpu_core@INT_MISC.CLEARS_COUNT@ / (cpu_core@BR_MISP= _RETIRED.ALL_BRANCHES@ + cpu_core@MACHINE_CLEARS.COUNT@)", + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", + "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", "MetricGroup": "BrMispredicts", "MetricName": "tma_info_bad_spec_spec_clears_ratio", "Unit": "cpu_core" @@ -1136,8 +3344,8 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_lsd = + tma_mite)))", - "MetricGroup": "DSB;FetchBW;tma_issueFB", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_lsd + tma_ms)))", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", @@ -1145,7 +3353,7 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_lsd + tma_mite))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd + tma_ms))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -1154,142 +3362,36 @@ }, { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", - "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: ", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "tma_info_bottleneck_big_code", - "MetricThreshold": "tma_info_bottleneck_big_code > 20", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", - "MetricExpr": "100 * ((cpu_core@BR_INST_RETIRED.ALL_BRANCHES@ + 2 = * cpu_core@BR_INST_RETIRED.NEAR_CALL@ + cpu_core@INST_RETIRED.NOP@) / tma_i= nfo_thread_slots)", - "MetricGroup": "BvBO;Ret", - "MetricName": "tma_info_bottleneck_branching_overhead", - "MetricThreshold": "tma_info_bottleneck_branching_overhead > 5", - "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_fb_full + tma_l1_h= it_latency + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)))", - "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", - "MetricName": "tma_info_bottleneck_cache_memory_bandwidth", - "MetricThreshold": "tma_info_bottleneck_cache_memory_bandwidth > 2= 0", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_store_bound) + tma_memory_bound * (tma_store_bound / (tma_dram_bou= nd + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_= store_latency / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tm= a_store_latency + tma_streaming_stores)) + tma_memory_bound * (tma_l1_bound= / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store= _bound)) * (tma_l1_hit_latency / (tma_dtlb_load + tma_fb_full + tma_l1_hit_= latency + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)))", - "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", - "MetricName": "tma_info_bottleneck_cache_memory_latency", - "MetricThreshold": "tma_info_bottleneck_cache_memory_latency > 20", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_ports_utilization + tma_serializing_operation) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_ports_utilization + tma_serializin= g_operation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", - "MetricGroup": "BvCB;Cor;tma_issueComp", - "MetricName": "tma_info_bottleneck_compute_bound_est", - "MetricThreshold": "tma_info_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: ", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - (1 - c= pu_core@INST_RETIRED.REP_ITERATION@ / cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D= 1@) * (tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_cl= ears_resteers + tma_mispredicts_resteers * tma_other_mispredicts / tma_bran= ch_mispredicts) / (tma_clears_resteers + tma_mispredicts_resteers + tma_unk= nown_branches)) / (tma_branch_resteers + tma_dsb_switches + tma_icache_miss= es + tma_itlb_misses + tma_lcp + tma_ms_switches))) - tma_info_bottleneck_b= ig_code", - "MetricGroup": "BvFB;Fed;FetchBW;Frontend", - "MetricName": "tma_info_bottleneck_instruction_fetch_bw", - "MetricThreshold": "tma_info_bottleneck_instruction_fetch_bw > 20", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * ((1 - cpu_core@INST_RETIRED.REP_ITERATION@ / = cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D1@) * (tma_fetch_latency * (tma_ms_swi= tches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_restee= rs * tma_other_mispredicts / tma_branch_mispredicts) / (tma_clears_resteers= + tma_mispredicts_resteers + tma_unknown_branches)) / (tma_branch_resteers= + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_m= s_switches)) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_b= ranch_mispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other= _nukes / tma_other_nukes + tma_core_bound * (tma_serializing_operation + cp= u_core@RS.EMPTY\\,umask\\=3D1@ / tma_info_thread_clks * tma_ports_utilized_= 0) / (tma_divider + tma_ports_utilization + tma_serializing_operation) + tm= a_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequence= r) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", - "MetricName": "tma_info_bottleneck_irregular_overhead", - "MetricThreshold": "tma_info_bottleneck_irregular_overhead > 10", - "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (= tma_dtlb_load / (tma_dtlb_load + tma_fb_full + tma_l1_hit_latency + tma_loc= k_latency + tma_split_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma= _store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound= + tma_store_bound)) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharin= g + tma_split_stores + tma_store_latency + tma_streaming_stores)))", - "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_data_tlbs", - "MetricThreshold": "tma_info_bottleneck_memory_data_tlbs > 20", - "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store, tma_info_bottleneck_memory_synchronization", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * = tma_false_sharing / (tma_dtlb_store + tma_false_sharing + tma_split_stores = + tma_store_latency + tma_streaming_stores - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", - "MetricGroup": "BvMS;Mem;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_synchronization", - "MetricThreshold": "tma_info_bottleneck_memory_synchronization > 1= 0", - "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_dtlb_load, tma_dtlb_store, tma_info_bottleneck_me= mory_data_tlbs", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", - "MetricName": "tma_info_bottleneck_mispredictions", - "MetricThreshold": "tma_info_bottleneck_mispredictions > 20", - "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", - "MetricExpr": "100 - (tma_info_bottleneck_big_code + tma_info_bott= leneck_instruction_fetch_bw + tma_info_bottleneck_mispredictions + tma_info= _bottleneck_cache_memory_bandwidth + tma_info_bottleneck_cache_memory_laten= cy + tma_info_bottleneck_memory_data_tlbs + tma_info_bottleneck_memory_sync= hronization + tma_info_bottleneck_compute_bound_est + tma_info_bottleneck_i= rregular_overhead + tma_info_bottleneck_branching_overhead + tma_info_bottl= eneck_useful_work)", - "MetricGroup": "BvOB;Cor;Offcore", - "MetricName": "tma_info_bottleneck_other_bottlenecks", - "MetricThreshold": "tma_info_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls.", - "Unit": "cpu_core" - }, - { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", - "MetricExpr": "100 * (tma_retiring - (cpu_core@BR_INST_RETIRED.ALL= _BRANCHES@ + 2 * cpu_core@BR_INST_RETIRED.NEAR_CALL@ + cpu_core@INST_RETIRE= D.NOP@) / tma_info_thread_slots - tma_microcode_sequencer / (tma_few_uops_i= nstructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_seque= ncer) * tma_heavy_operations)", - "MetricGroup": "BvUW;Ret", - "MetricName": "tma_info_bottleneck_useful_work", - "MetricThreshold": "tma_info_bottleneck_useful_work > 20", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of branches that are CALL or RET", - "MetricExpr": "(cpu_core@BR_INST_RETIRED.NEAR_CALL@ + cpu_core@BR_= INST_RETIRED.NEAR_RETURN@) / cpu_core@BR_INST_RETIRED.ALL_BRANCHES@", + "MetricExpr": "(BR_INST_RETIRED.NEAR_CALL + BR_INST_RETIRED.NEAR_R= ETURN) / BR_INST_RETIRED.ALL_BRANCHES", "MetricGroup": "Bad;Branches", "MetricName": "tma_info_branches_callret", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of branches that are non-taken condi= tionals", - "MetricExpr": "cpu_core@BR_INST_RETIRED.COND_NTAKEN@ / cpu_core@BR= _INST_RETIRED.ALL_BRANCHES@", + "MetricExpr": "BR_INST_RETIRED.COND_NTAKEN / BR_INST_RETIRED.ALL_B= RANCHES", "MetricGroup": "Bad;Branches;CodeGen;PGO", "MetricName": "tma_info_branches_cond_nt", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of branches that are taken condition= als", - "MetricExpr": "cpu_core@BR_INST_RETIRED.COND_TAKEN@ / cpu_core@BR_= INST_RETIRED.ALL_BRANCHES@", + "MetricExpr": "BR_INST_RETIRED.COND_TAKEN / BR_INST_RETIRED.ALL_BR= ANCHES", "MetricGroup": "Bad;Branches;CodeGen;PGO", "MetricName": "tma_info_branches_cond_tk", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of branches that are unconditional (= direct or indirect) jumps", - "MetricExpr": "(cpu_core@BR_INST_RETIRED.NEAR_TAKEN@ - cpu_core@BR= _INST_RETIRED.COND_TAKEN@ - 2 * cpu_core@BR_INST_RETIRED.NEAR_CALL@) / cpu_= core@BR_INST_RETIRED.ALL_BRANCHES@", + "MetricExpr": "(BR_INST_RETIRED.NEAR_TAKEN - BR_INST_RETIRED.COND_= TAKEN - 2 * BR_INST_RETIRED.NEAR_CALL) / BR_INST_RETIRED.ALL_BRANCHES", "MetricGroup": "Bad;Branches", "MetricName": "tma_info_branches_jump", "Unit": "cpu_core" @@ -1303,50 +3405,50 @@ }, { "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", - "MetricExpr": "(cpu_core@CPU_CLK_UNHALTED.DISTRIBUTED@ if #SMT_on = else tma_info_thread_clks)", + "MetricExpr": "(CPU_CLK_UNHALTED.DISTRIBUTED if #SMT_on else tma_i= nfo_thread_clks)", "MetricGroup": "SMT", "MetricName": "tma_info_core_core_clks", "Unit": "cpu_core" }, { "BriefDescription": "Instructions Per Cycle across hyper-threads (= per physical core)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / tma_info_core_core_clk= s", + "MetricExpr": "INST_RETIRED.ANY / tma_info_core_core_clks", "MetricGroup": "Ret;SMT;TmaL1;tma_L1_group", "MetricName": "tma_info_core_coreipc", "Unit": "cpu_core" }, { "BriefDescription": "uops Executed per Cycle", - "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / tma_info_thread_cl= ks", + "MetricExpr": "UOPS_EXECUTED.THREAD / tma_info_thread_clks", "MetricGroup": "Power", "MetricName": "tma_info_core_epc", "Unit": "cpu_core" }, { "BriefDescription": "Floating Point Operations Per Cycle", - "MetricExpr": "(cpu_core@FP_ARITH_INST_RETIRED.SCALAR@ + 2 * cpu_c= ore@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE@ + 4 * cpu_core@FP_ARITH_INST_= RETIRED.4_FLOPS@ + 8 * cpu_core@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE@) = / tma_info_core_core_clks", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / tma_info_core_core_clks", "MetricGroup": "Flops;Ret", "MetricName": "tma_info_core_flopc", "Unit": "cpu_core" }, { "BriefDescription": "Actual per-core usage of the Floating Point n= on-X87 execution units (regardless of precision or vector-width)", - "MetricExpr": "(cpu_core@FP_ARITH_DISPATCHED.PORT_0@ + cpu_core@FP= _ARITH_DISPATCHED.PORT_1@ + cpu_core@FP_ARITH_DISPATCHED.PORT_5@) / (2 * tm= a_info_core_core_clks)", + "MetricExpr": "(FP_ARITH_DISPATCHED.PORT_0 + FP_ARITH_DISPATCHED.P= ORT_1 + FP_ARITH_DISPATCHED.PORT_5) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n).", + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)", "Unit": "cpu_core" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / cpu_core@UOPS_EXEC= UTED.THREAD\\,cmask\\=3D1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu_core@UOPS_EXECUTED.THREA= D\\,cmask\\=3D0x1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of Uops delivered by the DSB (aka De= coded ICache; or Uop Cache)", - "MetricExpr": "cpu_core@IDQ.DSB_UOPS@ / cpu_core@UOPS_ISSUED.ANY@", + "MetricExpr": "IDQ.DSB_UOPS / UOPS_ISSUED.ANY", "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_frontend_dsb_coverage", "MetricThreshold": "tma_info_frontend_dsb_coverage < 0.7 & tma_inf= o_thread_ipc / 6 > 0.35", @@ -1354,29 +3456,37 @@ "Unit": "cpu_core" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", - "MetricExpr": "cpu_core@DSB2MITE_SWITCHES.PENALTY_CYCLES@ / cpu_co= re@DSB2MITE_SWITCHES.PENALTY_CYCLES\\,cmask\\=3D1\\,edge@", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu_core@DSB2MIT= E_SWITCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost", "Unit": "cpu_core" }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired DSB misses", + "MetricExpr": "FRONTEND_RETIRED.ANY_DSB_MISS * cpu_core@FRONTEND_R= ETIRED.ANY_DSB_MISS@R / tma_info_thread_clks", + "MetricGroup": "DSBmiss;Fed;FetchLat", + "MetricName": "tma_info_frontend_dsb_switches_ret", + "MetricThreshold": "tma_info_frontend_dsb_switches_ret > 0.05", + "Unit": "cpu_core" + }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "cpu_core@UOPS_ISSUED.ANY@ / cpu_core@UOPS_ISSUED.AN= Y\\,cmask\\=3D1@", + "MetricExpr": "UOPS_ISSUED.ANY / cpu_core@UOPS_ISSUED.ANY\\,cmask\= \=3D0x1@", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc", "Unit": "cpu_core" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "cpu_core@ICACHE_DATA.STALLS@ / cpu_core@ICACHE_DATA= .STALLS\\,cmask\\=3D1\\,edge@", + "MetricExpr": "ICACHE_DATA.STALLS / ICACHE_DATA.STALL_PERIODS", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per non-speculative DSB miss (lo= wer number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@FRONTEND_RETI= RED.ANY_DSB_MISS@", + "MetricExpr": "INST_RETIRED.ANY / FRONTEND_RETIRED.ANY_DSB_MISS", "MetricGroup": "DSBmiss;Fed", "MetricName": "tma_info_frontend_ipdsb_miss_ret", "MetricThreshold": "tma_info_frontend_ipdsb_miss_ret < 50", @@ -1384,50 +3494,72 @@ }, { "BriefDescription": "Instructions per speculative Unknown Branch M= isprediction (BAClear) (lower number means higher occurrence rate)", - "MetricExpr": "tma_info_inst_mix_instructions / cpu_core@BACLEARS.= ANY@", + "MetricExpr": "tma_info_inst_mix_instructions / BACLEARS.ANY", "MetricGroup": "Fed", "MetricName": "tma_info_frontend_ipunknown_branch", "Unit": "cpu_core" }, { "BriefDescription": "L2 cache true code cacheline misses per kilo = instruction", - "MetricExpr": "1e3 * cpu_core@FRONTEND_RETIRED.L2_MISS@ / cpu_core= @INST_RETIRED.ANY@", + "MetricExpr": "1e3 * FRONTEND_RETIRED.L2_MISS / INST_RETIRED.ANY", "MetricGroup": "IcMiss", "MetricName": "tma_info_frontend_l2mpki_code", "Unit": "cpu_core" }, { "BriefDescription": "L2 cache speculative code cacheline misses pe= r kilo instruction", - "MetricExpr": "1e3 * cpu_core@L2_RQSTS.CODE_RD_MISS@ / cpu_core@IN= ST_RETIRED.ANY@", + "MetricExpr": "1e3 * L2_RQSTS.CODE_RD_MISS / INST_RETIRED.ANY", "MetricGroup": "IcMiss", "MetricName": "tma_info_frontend_l2mpki_code_all", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of Uops delivered by the LSD (Loop S= tream Detector; aka Loop Cache)", - "MetricExpr": "cpu_core@LSD.UOPS@ / cpu_core@UOPS_ISSUED.ANY@", + "MetricExpr": "LSD.UOPS / UOPS_ISSUED.ANY", "MetricGroup": "Fed;LSD", "MetricName": "tma_info_frontend_lsd_coverage", "Unit": "cpu_core" }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired operations that invoke th= e Microcode Sequencer", + "MetricExpr": "FRONTEND_RETIRED.MS_FLOWS * cpu_core@FRONTEND_RETIR= ED.MS_FLOWS@R / tma_info_thread_clks", + "MetricGroup": "Fed;FetchLat;MicroSeq", + "MetricName": "tma_info_frontend_ms_latency_ret", + "MetricThreshold": "tma_info_frontend_ms_latency_ret > 0.05", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc", + "Unit": "cpu_core" + }, { "BriefDescription": "Average number of cycles the front-end was de= layed due to an Unknown Branch detection", - "MetricExpr": "cpu_core@INT_MISC.UNKNOWN_BRANCH_CYCLES@ / cpu_core= @INT_MISC.UNKNOWN_BRANCH_CYCLES\\,cmask\\=3D1\\,edge@", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu_core@INT_MISC.= UNKNOWN_BRANCH_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "Fed", "MetricName": "tma_info_frontend_unknown_branch_cost", - "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node.", + "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to retired branches who got branch a= ddress clears", + "MetricExpr": "FRONTEND_RETIRED.UNKNOWN_BRANCH * cpu_core@FRONTEND= _RETIRED.UNKNOWN_BRANCH@R / tma_info_thread_clks", + "MetricGroup": "Fed;FetchLat", + "MetricName": "tma_info_frontend_unknown_branches_ret", "Unit": "cpu_core" }, { - "BriefDescription": "Branch instructions per taken branch.", - "MetricExpr": "cpu_core@BR_INST_RETIRED.ALL_BRANCHES@ / cpu_core@B= R_INST_RETIRED.NEAR_TAKEN@", + "BriefDescription": "Branch instructions per taken branch", + "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch", "Unit": "cpu_core" }, { "BriefDescription": "Total number of retired Instructions", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@", + "MetricExpr": "INST_RETIRED.ANY", "MetricGroup": "Summary;TmaL1;tma_L1_group", "MetricName": "tma_info_inst_mix_instructions", "PublicDescription": "Total number of retired Instructions. Sample= with: INST_RETIRED.PREC_DIST", @@ -1435,52 +3567,52 @@ }, { "BriefDescription": "Instructions per FP Arithmetic instruction (l= ower number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / (cpu_core@FP_ARITH_INS= T_RETIRED.SCALAR@ + cpu_core@FP_ARITH_INST_RETIRED.VECTOR@)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = FP_ARITH_INST_RETIRED.VECTOR)", "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW.", + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / (cpu_core@FP_ARITH_INS= T_RETIRED.128B_PACKED_DOUBLE@ + cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_= SINGLE@)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.128B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.128B_PACKED_SINGLE)", "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting.", + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / (cpu_core@FP_ARITH_INS= T_RETIRED.256B_PACKED_DOUBLE@ + cpu_core@FP_ARITH_INST_RETIRED.256B_PACKED_= SINGLE@)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.256B_PACK= ED_DOUBLE + FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting.", + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@FP_ARITH_INST= _RETIRED.SCALAR_DOUBLE@", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_DOU= BLE", "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting.", + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@FP_ARITH_INST= _RETIRED.SCALAR_SINGLE@", + "MetricExpr": "INST_RETIRED.ANY / FP_ARITH_INST_RETIRED.SCALAR_SIN= GLE", "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting.", + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_INST_RETIR= ED.ALL_BRANCHES@", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.ALL_BRANCHES", "MetricGroup": "Branches;Fed;InsType", "MetricName": "tma_info_inst_mix_ipbranch", "MetricThreshold": "tma_info_inst_mix_ipbranch < 8", @@ -1488,7 +3620,7 @@ }, { "BriefDescription": "Instructions per (near) call (lower number me= ans higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_INST_RETIR= ED.NEAR_CALL@", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_CALL", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_ipcall", "MetricThreshold": "tma_info_inst_mix_ipcall < 200", @@ -1496,7 +3628,7 @@ }, { "BriefDescription": "Instructions per Floating Point (FP) Operatio= n (lower number means higher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / (cpu_core@FP_ARITH_INS= T_RETIRED.SCALAR@ + 2 * cpu_core@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE@ = + 4 * cpu_core@FP_ARITH_INST_RETIRED.4_FLOPS@ + 8 * cpu_core@FP_ARITH_INST_= RETIRED.256B_PACKED_SINGLE@)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = 2 * FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_= FLOPS + 8 * FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE)", "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_ipflop", "MetricThreshold": "tma_info_inst_mix_ipflop < 10", @@ -1504,7 +3636,7 @@ }, { "BriefDescription": "Instructions per Load (lower number means hig= her occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@MEM_INST_RETI= RED.ALL_LOADS@", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_LOADS", "MetricGroup": "InsType", "MetricName": "tma_info_inst_mix_ipload", "MetricThreshold": "tma_info_inst_mix_ipload < 3", @@ -1512,14 +3644,14 @@ }, { "BriefDescription": "Instructions per PAUSE (lower number means hi= gher occurrence rate)", - "MetricExpr": "tma_info_inst_mix_instructions / cpu_core@CPU_CLK_U= NHALTED.PAUSE_INST@", + "MetricExpr": "tma_info_inst_mix_instructions / CPU_CLK_UNHALTED.P= AUSE_INST", "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_ippause", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per Store (lower number means hi= gher occurrence rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@MEM_INST_RETI= RED.ALL_STORES@", + "MetricExpr": "INST_RETIRED.ANY / MEM_INST_RETIRED.ALL_STORES", "MetricGroup": "InsType", "MetricName": "tma_info_inst_mix_ipstore", "MetricThreshold": "tma_info_inst_mix_ipstore < 8", @@ -1527,7 +3659,7 @@ }, { "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@SW_PREFETCH_A= CCESS.T0\\,umask\\=3D0xF@", + "MetricExpr": "INST_RETIRED.ANY / SW_PREFETCH_ACCESS.ANY", "MetricGroup": "Prefetches", "MetricName": "tma_info_inst_mix_ipswpf", "MetricThreshold": "tma_info_inst_mix_ipswpf < 100", @@ -1535,10 +3667,10 @@ }, { "BriefDescription": "Instructions per taken branch", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_INST_RETIR= ED.NEAR_TAKEN@", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 13", + "MetricThreshold": "tma_info_inst_mix_iptb < 6 * 2 + 1", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp", "Unit": "cpu_core" }, @@ -1572,235 +3704,259 @@ }, { "BriefDescription": "Fill Buffer (FB) hits per kilo instructions f= or retired demand loads (L1D misses that merge into ongoing miss-handling e= ntries)", - "MetricExpr": "1e3 * cpu_core@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@= INST_RETIRED.ANY@", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", "MetricGroup": "CacheHits;Mem", "MetricName": "tma_info_memory_fb_hpki", "Unit": "cpu_core" }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * cpu_core@L1D.REPLACEMENT@ / 1e9 / duration_tim= e", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw", "Unit": "cpu_core" }, { "BriefDescription": "L1 cache true misses per kilo instruction for= retired demand loads", - "MetricExpr": "1e3 * cpu_core@MEM_LOAD_RETIRED.L1_MISS@ / cpu_core= @INST_RETIRED.ANY@", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", "MetricGroup": "CacheHits;Mem", "MetricName": "tma_info_memory_l1mpki", "Unit": "cpu_core" }, { "BriefDescription": "L1 cache true misses per kilo instruction for= all demand loads (including speculative)", - "MetricExpr": "1e3 * cpu_core@L2_RQSTS.ALL_DEMAND_DATA_RD@ / cpu_c= ore@INST_RETIRED.ANY@", + "MetricExpr": "1e3 * L2_RQSTS.ALL_DEMAND_DATA_RD / INST_RETIRED.AN= Y", "MetricGroup": "CacheHits;Mem", "MetricName": "tma_info_memory_l1mpki_load", "Unit": "cpu_core" }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * cpu_core@L2_LINES_IN.ALL@ / 1e9 / duration_tim= e", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw", "Unit": "cpu_core" }, { "BriefDescription": "L2 cache hits per kilo instruction for all re= quest types (including speculative)", - "MetricExpr": "1e3 * (cpu_core@L2_RQSTS.REFERENCES@ - cpu_core@L2_= RQSTS.MISS@) / cpu_core@INST_RETIRED.ANY@", + "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_= RETIRED.ANY", "MetricGroup": "CacheHits;Mem", "MetricName": "tma_info_memory_l2hpki_all", "Unit": "cpu_core" }, { "BriefDescription": "L2 cache hits per kilo instruction for all de= mand loads (including speculative)", - "MetricExpr": "1e3 * cpu_core@L2_RQSTS.DEMAND_DATA_RD_HIT@ / cpu_c= ore@INST_RETIRED.ANY@", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_HIT / INST_RETIRED.AN= Y", "MetricGroup": "CacheHits;Mem", "MetricName": "tma_info_memory_l2hpki_load", "Unit": "cpu_core" }, { "BriefDescription": "L2 cache true misses per kilo instruction for= retired demand loads", - "MetricExpr": "1e3 * cpu_core@MEM_LOAD_RETIRED.L2_MISS@ / cpu_core= @INST_RETIRED.ANY@", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L2_MISS / INST_RETIRED.ANY", "MetricGroup": "Backend;CacheHits;Mem", "MetricName": "tma_info_memory_l2mpki", "Unit": "cpu_core" }, { "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all request types (including speculative)", - "MetricExpr": "1e3 * cpu_core@L2_RQSTS.MISS@ / cpu_core@INST_RETIR= ED.ANY@", + "MetricExpr": "1e3 * L2_RQSTS.MISS / INST_RETIRED.ANY", "MetricGroup": "CacheHits;Mem;Offcore", "MetricName": "tma_info_memory_l2mpki_all", "Unit": "cpu_core" }, { "BriefDescription": "L2 cache ([RKL+] true) misses per kilo instru= ction for all demand loads (including speculative)", - "MetricExpr": "1e3 * cpu_core@L2_RQSTS.DEMAND_DATA_RD_MISS@ / cpu_= core@INST_RETIRED.ANY@", + "MetricExpr": "1e3 * L2_RQSTS.DEMAND_DATA_RD_MISS / INST_RETIRED.A= NY", "MetricGroup": "CacheHits;Mem", "MetricName": "tma_info_memory_l2mpki_load", "Unit": "cpu_core" }, { "BriefDescription": "Offcore requests (L2 cache miss) per kilo ins= truction for demand RFOs", - "MetricExpr": "1e3 * cpu_core@L2_RQSTS.RFO_MISS@ / cpu_core@INST_R= ETIRED.ANY@", + "MetricExpr": "1e3 * L2_RQSTS.RFO_MISS / INST_RETIRED.ANY", "MetricGroup": "CacheMisses;Offcore", "MetricName": "tma_info_memory_l2mpki_rfo", "Unit": "cpu_core" }, { "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", - "MetricExpr": "64 * cpu_core@OFFCORE_REQUESTS.ALL_REQUESTS@ / 1e9 = / duration_time", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", "MetricGroup": "Mem;MemoryBW;Offcore", "MetricName": "tma_info_memory_l3_cache_access_bw", "Unit": "cpu_core" }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * cpu_core@LONGEST_LAT_CACHE.MISS@ / 1e9 / durat= ion_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw", "Unit": "cpu_core" }, { "BriefDescription": "L3 cache true misses per kilo instruction for= retired demand loads", - "MetricExpr": "1e3 * cpu_core@MEM_LOAD_RETIRED.L3_MISS@ / cpu_core= @INST_RETIRED.ANY@", + "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", "MetricGroup": "Mem", "MetricName": "tma_info_memory_l3mpki", "Unit": "cpu_core" }, { "BriefDescription": "Average Parallel L2 cache miss data reads", - "MetricExpr": "cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DATA_RD@ / cp= u_core@OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD@", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DATA_RD / OFFCORE_REQU= ESTS_OUTSTANDING.CYCLES_WITH_DATA_RD", "MetricGroup": "Memory_BW;Offcore", "MetricName": "tma_info_memory_latency_data_l2_mlp", "Unit": "cpu_core" }, { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", - "MetricExpr": "cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_R= D@ / cpu_core@OFFCORE_REQUESTS.DEMAND_DATA_RD@", - "MetricGroup": "Memory_Lat;Offcore", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency", "Unit": "cpu_core" }, { "BriefDescription": "Average Parallel L2 cache miss demand Loads", - "MetricExpr": "cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_R= D@ / cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D1@", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu_c= ore@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", "MetricGroup": "Memory_BW;Offcore", "MetricName": "tma_info_memory_latency_load_l2_mlp", "Unit": "cpu_core" }, { "BriefDescription": "Average Latency for L3 cache miss demand Load= s", - "MetricExpr": "cpu_core@OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAN= D_DATA_RD@ / cpu_core@OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD@", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.L3_MISS_DEMAND_DATA_RD= / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", "MetricGroup": "Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l3_miss_latency", "Unit": "cpu_core" }, { "BriefDescription": "Actual Average Latency for L1 data-cache miss= demand load operations (in core cycles)", - "MetricExpr": "cpu_core@L1D_PEND_MISS.PENDING@ / cpu_core@MEM_LOAD= _COMPLETED.L1_MISS_ANY@", + "MetricExpr": "L1D_PEND_MISS.PENDING / MEM_LOAD_COMPLETED.L1_MISS_= ANY", "MetricGroup": "Mem;MemoryBound;MemoryLat", "MetricName": "tma_info_memory_load_miss_real_latency", "Unit": "cpu_core" }, { "BriefDescription": "\"Bus lock\" per kilo instruction", - "MetricExpr": "1e3 * cpu_core@SQ_MISC.BUS_LOCK@ / cpu_core@INST_RE= TIRED.ANY@", + "MetricExpr": "1e3 * SQ_MISC.BUS_LOCK / INST_RETIRED.ANY", "MetricGroup": "Mem", "MetricName": "tma_info_memory_mix_bus_lock_pki", "Unit": "cpu_core" }, { "BriefDescription": "Un-cacheable retired load per kilo instructio= n", - "MetricExpr": "1e3 * cpu_core@MEM_LOAD_MISC_RETIRED.UC@ / cpu_core= @INST_RETIRED.ANY@", + "MetricExpr": "1e3 * MEM_LOAD_MISC_RETIRED.UC / INST_RETIRED.ANY", "MetricGroup": "Mem", "MetricName": "tma_info_memory_mix_uc_load_pki", "Unit": "cpu_core" }, { "BriefDescription": "Memory-Level-Parallelism (average number of L= 1 miss demand load when there is at least one such miss", - "MetricExpr": "cpu_core@L1D_PEND_MISS.PENDING@ / cpu_core@L1D_PEND= _MISS.PENDING_CYCLES@", + "MetricExpr": "L1D_PEND_MISS.PENDING / L1D_PEND_MISS.PENDING_CYCLE= S", "MetricGroup": "Mem;MemoryBW;MemoryBound", "MetricName": "tma_info_memory_mlp", "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)", "Unit": "cpu_core" }, + { + "BriefDescription": "Rate of L2 HW prefetched lines that were not = used by demand accesses", + "MetricExpr": "L2_LINES_OUT.USELESS_HWPF / (L2_LINES_OUT.SILENT + = L2_LINES_OUT.NON_SILENT)", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_memory_prefetches_useless_hwpf", + "MetricThreshold": "tma_info_memory_prefetches_useless_hwpf > 0.15= ", + "Unit": "cpu_core" + }, { "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", - "MetricExpr": "1e3 * cpu_core@ITLB_MISSES.WALK_COMPLETED@ / cpu_co= re@INST_RETIRED.ANY@", + "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", "MetricGroup": "Fed;MemoryTLB", "MetricName": "tma_info_memory_tlb_code_stlb_mpki", "Unit": "cpu_core" }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand loads", + "MetricExpr": "MEM_INST_RETIRED.STLB_MISS_LOADS * cpu_core@MEM_INS= T_RETIRED.STLB_MISS_LOADS@R / tma_info_thread_clks", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_load_stlb_miss_ret", + "MetricThreshold": "tma_info_memory_tlb_load_stlb_miss_ret > 0.05", + "Unit": "cpu_core" + }, { "BriefDescription": "STLB (2nd level TLB) data load speculative mi= sses per kilo instruction (misses of any page-size that complete the page w= alk)", - "MetricExpr": "1e3 * cpu_core@DTLB_LOAD_MISSES.WALK_COMPLETED@ / c= pu_core@INST_RETIRED.ANY@", + "MetricExpr": "1e3 * DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRE= D.ANY", "MetricGroup": "Mem;MemoryTLB", "MetricName": "tma_info_memory_tlb_load_stlb_mpki", "Unit": "cpu_core" }, { "BriefDescription": "Utilization of the core's Page Walker(s) serv= ing STLB misses triggered by instruction/Load/Store accesses", - "MetricExpr": "(cpu_core@ITLB_MISSES.WALK_PENDING@ + cpu_core@DTLB= _LOAD_MISSES.WALK_PENDING@ + cpu_core@DTLB_STORE_MISSES.WALK_PENDING@) / (4= * tma_info_core_core_clks)", + "MetricExpr": "(ITLB_MISSES.WALK_PENDING + DTLB_LOAD_MISSES.WALK_P= ENDING + DTLB_STORE_MISSES.WALK_PENDING) / (4 * tma_info_core_core_clks)", "MetricGroup": "Mem;MemoryTLB", "MetricName": "tma_info_memory_tlb_page_walks_utilization", "MetricThreshold": "tma_info_memory_tlb_page_walks_utilization > 0= .5", "Unit": "cpu_core" }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU retirement was stalled likely due to STLB misses by demand stores", + "MetricExpr": "MEM_INST_RETIRED.STLB_MISS_STORES * cpu_core@MEM_IN= ST_RETIRED.STLB_MISS_STORES@R / tma_info_thread_clks", + "MetricGroup": "Mem;MemoryTLB", + "MetricName": "tma_info_memory_tlb_store_stlb_miss_ret", + "MetricThreshold": "tma_info_memory_tlb_store_stlb_miss_ret > 0.05= ", + "Unit": "cpu_core" + }, { "BriefDescription": "STLB (2nd level TLB) data store speculative m= isses per kilo instruction (misses of any page-size that complete the page = walk)", - "MetricExpr": "1e3 * cpu_core@DTLB_STORE_MISSES.WALK_COMPLETED@ / = cpu_core@INST_RETIRED.ANY@", + "MetricExpr": "1e3 * DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIR= ED.ANY", "MetricGroup": "Mem;MemoryTLB", "MetricName": "tma_info_memory_tlb_store_stlb_mpki", "Unit": "cpu_core" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / (cpu_core@UOPS_EXE= CUTED.CORE_CYCLES_GE_1@ / 2 if #SMT_on else cpu_core@UOPS_EXECUTED.THREAD\\= ,cmask\\=3D1@)", + "BriefDescription": "", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu_core@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute", "Unit": "cpu_core" }, { "BriefDescription": "Average number of uops fetched from DSB per c= ycle", - "MetricExpr": "cpu_core@IDQ.DSB_UOPS@ / cpu_core@IDQ.DSB_CYCLES_AN= Y@", + "MetricExpr": "IDQ.DSB_UOPS / IDQ.DSB_CYCLES_ANY", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_pipeline_fetch_dsb", "Unit": "cpu_core" }, { "BriefDescription": "Average number of uops fetched from LSD per c= ycle", - "MetricExpr": "cpu_core@LSD.UOPS@ / cpu_core@LSD.CYCLES_ACTIVE@", + "MetricExpr": "LSD.UOPS / LSD.CYCLES_ACTIVE", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_pipeline_fetch_lsd", "Unit": "cpu_core" }, { "BriefDescription": "Average number of uops fetched from MITE per = cycle", - "MetricExpr": "cpu_core@IDQ.MITE_UOPS@ / cpu_core@IDQ.MITE_CYCLES_= ANY@", + "MetricExpr": "IDQ.MITE_UOPS / IDQ.MITE_CYCLES_ANY", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_pipeline_fetch_mite", "Unit": "cpu_core" }, { "BriefDescription": "Instructions per a microcode Assist invocatio= n", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@ASSISTS.ANY@", + "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)", "Unit": "cpu_core" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire", "Unit": "cpu_core" }, { "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", - "MetricExpr": "cpu_core@INST_RETIRED.REP_ITERATION@ / cpu_core@UOP= S_RETIRED.SLOTS\\,cmask\\=3D1@", + "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu_core@UOPS_RETIRED.= SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "MicroSeq;Pipeline;Ret", "MetricName": "tma_info_pipeline_strings_cycles", "MetricThreshold": "tma_info_pipeline_strings_cycles > 0.1", @@ -1808,7 +3964,7 @@ }, { "BriefDescription": "Fraction of cycles the processor is waiting y= et unhalted; covering legacy PAUSE instruction, as well as C0.1 / C0.2 powe= r-performance optimized states", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.C0_WAIT@ / tma_info_threa= d_clks", + "MetricExpr": "CPU_CLK_UNHALTED.C0_WAIT / tma_info_thread_clks", "MetricGroup": "C0Wait", "MetricName": "tma_info_system_c0_wait", "MetricThreshold": "tma_info_system_c0_wait > 0.05", @@ -1816,7 +3972,7 @@ }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency", "Unit": "cpu_core" @@ -1830,22 +3986,22 @@ }, { "BriefDescription": "Average number of utilized CPUs", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.REF_TSC@ / TSC", + "MetricExpr": "CPU_CLK_UNHALTED.REF_TSC / TSC", "MetricGroup": "Summary", "MetricName": "tma_info_system_cpus_utilized", "Unit": "cpu_core" }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_HAC_ARB_TRK_REQUESTS.ALL + UNC_HAC_ARB_CO= H_TRK_REQUESTS.ALL) / 1e9 / duration_time", + "MetricExpr": "64 * (UNC_HAC_ARB_TRK_REQUESTS.ALL + UNC_HAC_ARB_CO= H_TRK_REQUESTS.ALL) / 1e9 / tma_info_system_time", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", - "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_bottlenec= k_cache_memory_bandwidth, tma_mem_bandwidth, tma_sq_full", + "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_bottleneck_cache_memory_ban= dwidth, tma_fb_full, tma_mem_bandwidth, tma_sq_full", "Unit": "cpu_core" }, { "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(cpu_core@FP_ARITH_INST_RETIRED.SCALAR@ + 2 * cpu_c= ore@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE@ + 4 * cpu_core@FP_ARITH_INST_= RETIRED.4_FLOPS@ + 8 * cpu_core@FP_ARITH_INST_RETIRED.256B_PACKED_SINGLE@) = / 1e9 / duration_time", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / tma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width", @@ -1853,22 +4009,23 @@ }, { "BriefDescription": "Instructions per Far Branch ( Far Branches ap= ply upon transition from application to operating system, handling interrup= ts, exceptions) [lower number means higher occurrence rate]", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / cpu_core@BR_INST_RETIR= ED.FAR_BRANCH@u", + "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6", + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000", "Unit": "cpu_core" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.THREAD_P@k / cpu_core@INS= T_RETIRED.ANY_P@k", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.THREAD_P@k / cpu_core@CPU= _CLK_UNHALTED.THREAD@", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / CPU_CLK_UNHALTED.THRE= AD", "MetricGroup": "OS", "MetricName": "tma_info_system_kernel_utilization", "MetricThreshold": "tma_info_system_kernel_utilization > 0.05", @@ -1876,15 +4033,30 @@ }, { "BriefDescription": "Average number of parallel data read requests= to external memory", - "MetricExpr": "UNC_ARB_DAT_OCCUPANCY.RD / UNC_ARB_DAT_OCCUPANCY.RD= @cmask\\=3D1@", + "MetricExpr": "UNC_ARB_DAT_OCCUPANCY.RD / UNC_ARB_DAT_OCCUPANCY.RD= @cmask\\=3D0x1@", "MetricGroup": "Mem;MemoryBW;SoC", "MetricName": "tma_info_system_mem_parallel_reads", "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches", "Unit": "cpu_core" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9", + "Unit": "cpu_core" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "power@energy\\-pkg@ * 61 / (tma_info_system_time * = 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power", + "Unit": "cpu_core" + }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", - "MetricExpr": "(1 - cpu_core@CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE@ /= cpu_core@CPU_CLK_UNHALTED.REF_DISTRIBUTED@ if #SMT_on else 0)", + "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / CPU_CLK_U= NHALTED.REF_DISTRIBUTED if #SMT_on else 0)", "MetricGroup": "SMT", "MetricName": "tma_info_system_smt_2t_utilization", "Unit": "cpu_core" @@ -1896,16 +4068,24 @@ "MetricName": "tma_info_system_socket_clks", "Unit": "cpu_core" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1", + "Unit": "cpu_core" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", - "MetricExpr": "tma_info_thread_clks / cpu_core@CPU_CLK_UNHALTED.RE= F_TSC@", + "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", "MetricGroup": "Power", "MetricName": "tma_info_system_turbo_utilization", "Unit": "cpu_core" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.THREAD@", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks", "Unit": "cpu_core" @@ -1915,40 +4095,41 @@ "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr", "Unit": "cpu_core" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", - "MetricExpr": "cpu_core@UOPS_EXECUTED.THREAD@ / cpu_core@UOPS_ISSU= ED.ANY@", + "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage.", + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage", "Unit": "cpu_core" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", - "MetricExpr": "cpu_core@INST_RETIRED.ANY@ / tma_info_thread_clks", + "MetricExpr": "INST_RETIRED.ANY / tma_info_thread_clks", "MetricGroup": "Ret;Summary", "MetricName": "tma_info_thread_ipc", "Unit": "cpu_core" }, { "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "cpu_core@TOPDOWN.SLOTS@", + "MetricExpr": "slots", "MetricGroup": "TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots", "Unit": "cpu_core" }, { "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", - "MetricExpr": "(tma_info_thread_slots / (cpu_core@TOPDOWN.SLOTS@ /= 2) if #SMT_on else 1)", + "MetricExpr": "(tma_info_thread_slots / (slots / 2) if #SMT_on els= e 1)", "MetricGroup": "SMT;TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots_utilization", "Unit": "cpu_core" }, { "BriefDescription": "Uops Per Instruction", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@INS= T_RETIRED.ANY@", + "MetricExpr": "tma_retiring * tma_info_thread_slots / INST_RETIRED= .ANY", "MetricGroup": "Pipeline;Ret;Retire", "MetricName": "tma_info_thread_uoppi", "MetricThreshold": "tma_info_thread_uoppi > 1.05", @@ -1956,10 +4137,19 @@ }, { "BriefDescription": "Uops per taken branch", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu_core@BR_= INST_RETIRED.NEAR_TAKEN@", + "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 9", + "MetricThreshold": "tma_info_thread_uptb < 6 * 1.5", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "MetricExpr": "tma_divider - tma_fp_divider", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_int_divider", + "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", "Unit": "cpu_core" }, { @@ -1968,114 +4158,124 @@ "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_int_operations", "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operation= s > 0.6", - "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain.", + "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents 128-bit vector Integer= ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction the= CPU has retired", - "MetricExpr": "(cpu_core@INT_VEC_RETIRED.ADD_128@ + cpu_core@INT_V= EC_RETIRED.VNNI_128@) / (tma_retiring * tma_info_thread_slots)", + "MetricExpr": "(INT_VEC_RETIRED.ADD_128 + INT_VEC_RETIRED.VNNI_128= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", "MetricName": "tma_int_vector_128b", - "MetricThreshold": "tma_int_vector_128b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", - "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_256b, tma= _port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_int_vector_128b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_int_vector_256b, tma_port_0, tma_port_1,= tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents 256-bit vector Integer= ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fraction= the CPU has retired", - "MetricExpr": "(cpu_core@INT_VEC_RETIRED.ADD_256@ + cpu_core@INT_V= EC_RETIRED.MUL_256@ + cpu_core@INT_VEC_RETIRED.VNNI_256@) / (tma_retiring *= tma_info_thread_slots)", + "MetricExpr": "(INT_VEC_RETIRED.ADD_256 + INT_VEC_RETIRED.MUL_256 = + INT_VEC_RETIRED.VNNI_256) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", "MetricName": "tma_int_vector_256b", - "MetricThreshold": "tma_int_vector_256b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", - "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b,= tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_int_vector_256b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_port_0, tma_por= t_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", - "MetricExpr": "cpu_core@ICACHE_TAG.STALLS@ / tma_info_thread_clks", + "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", - "MetricExpr": "max((cpu_core@EXE_ACTIVITY.BOUND_ON_LOADS@ - cpu_co= re@MEMORY_ACTIVITY.STALLS_L1D_MISS@) / tma_info_thread_clks, 0)", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", + "MetricExpr": "max((EXE_ACTIVITY.BOUND_ON_LOADS - MEMORY_ACTIVITY.= STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB= _HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_micr= ocode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric roughly estimates fraction of cyc= les with demand load accesses that hit the L1 cache", - "MetricExpr": "min(2 * (cpu_core@MEM_INST_RETIRED.ALL_LOADS@ - cpu= _core@MEM_LOAD_RETIRED.FB_HIT@ - cpu_core@MEM_LOAD_RETIRED.L1_MISS@) * 20 /= 100, max(cpu_core@CYCLE_ACTIVITY.CYCLES_MEM_ANY@ - cpu_core@MEMORY_ACTIVIT= Y.CYCLES_L1D_MISS@, 0)) / tma_info_thread_clks", + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", + "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - MEMORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", - "MetricName": "tma_l1_hit_latency", - "MetricThreshold": "tma_l1_hit_latency > 0.1 & (tma_l1_bound > 0.1= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles with demand load accesses that hit the L1 cache. The short latency of = the L1 data cache may be exposed in pointer-chasing memory access patterns = as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", - "MetricExpr": "(cpu_core@MEMORY_ACTIVITY.STALLS_L1D_MISS@ - cpu_co= re@MEMORY_ACTIVITY.STALLS_L2_MISS@) / tma_info_thread_clks", + "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L1D_MISS - MEMORY_ACTIVITY.= STALLS_L2_MISS) / tma_info_thread_clks", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT_PS", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L2_HIT * cpu_core@MEM_LOAD_RE= TIRED.L2_HIT@R, MEM_LOAD_RETIRED.L2_HIT * (3 * tma_info_system_core_frequen= cy)) if 0 < cpu_core@MEM_LOAD_RETIRED.L2_HIT@R else MEM_LOAD_RETIRED.L2_HIT= * (3 * tma_info_system_core_frequency)) * (1 + MEM_LOAD_RETIRED.FB_HIT / M= EM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates how often the CPU was s= talled due to loads accesses to L3 cache or contended with a sibling Core", - "MetricExpr": "(cpu_core@MEMORY_ACTIVITY.STALLS_L2_MISS@ - cpu_cor= e@MEMORY_ACTIVITY.STALLS_L3_MISS@) / tma_info_thread_clks", + "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L2_MISS - MEMORY_ACTIVITY.S= TALLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "cpu_core@MEM_LOAD_RETIRED.L3_HIT@ * min(cpu_core@ME= M_LOAD_RETIRED.L3_HIT@R, 9 * tma_info_system_core_frequency) * (1 + cpu_cor= e@MEM_LOAD_RETIRED.FB_HIT@ / cpu_core@MEM_LOAD_RETIRED.L1_MISS@ / 2) / tma_= info_thread_clks", + "MetricExpr": "(min(MEM_LOAD_RETIRED.L3_HIT * cpu_core@MEM_LOAD_RE= TIRED.L3_HIT@R, MEM_LOAD_RETIRED.L3_HIT * (12 * tma_info_system_core_freque= ncy) - 3 * tma_info_system_core_frequency) if 0 < cpu_core@MEM_LOAD_RETIRED= .L3_HIT@R else MEM_LOAD_RETIRED.L3_HIT * (12 * tma_info_system_core_frequen= cy) - 3 * tma_info_system_core_frequency) * (1 + MEM_LOAD_RETIRED.FB_HIT / = MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_info_bottleneck_cache_memory_latency, tma_mem_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles CPU= was stalled due to Length Changing Prefixes (LCPs)", - "MetricExpr": "cpu_core@DECODE.LCP@ / tma_info_thread_clks", + "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Load operations", - "MetricExpr": "cpu_core@UOPS_DISPATCHED.PORT_2_3_10@ / (3 * tma_in= fo_core_core_clks)", + "MetricExpr": "UOPS_DISPATCHED.PORT_2_3_10 / (3 * tma_info_core_co= re_clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_load_op_utilization", "MetricThreshold": "tma_load_op_utilization > 0.6", @@ -2088,36 +4288,63 @@ "MetricExpr": "max(0, tma_dtlb_load - tma_load_stlb_miss)", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by load accesses, performing a= hardware page walk", - "MetricExpr": "cpu_core@DTLB_LOAD_MISSES.WALK_ACTIVE@ / tma_info_t= hread_clks", + "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", - "MetricExpr": "cpu_core@MEM_INST_RETIRED.LOCK_LOADS@ * cpu_core@ME= M_INST_RETIRED.LOCK_LOADS@R / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricExpr": "MEM_INST_RETIRED.LOCK_LOADS * cpu_core@MEM_INST_RET= IRED.LOCK_LOADS@R / tma_info_thread_clks", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to LSD (Loop Stream Detector) unit", - "MetricExpr": "(cpu_core@LSD.CYCLES_ACTIVE@ - cpu_core@LSD.CYCLES_= OK@) / tma_info_core_core_clks / 2", + "MetricExpr": "(LSD.CYCLES_ACTIVE - LSD.CYCLES_OK) / tma_info_core= _core_clks / 2", "MetricGroup": "FetchBW;LSD;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_lsd", "MetricThreshold": "tma_lsd > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure", "ScaleUnit": "100%", "Unit": "cpu_core" }, @@ -2128,54 +4355,54 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFF= CORE_REQUESTS_OUTSTANDING.DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu_core@OFFCORE_REQUE= STS_OUTSTANDING.DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_bottleneck_cache= _memory_bandwidth, tma_info_system_dram_bw_use, tma_sq_full", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_sq_full", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the performance was likely hurt due to latency from external memory - DRA= M ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFF= CORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DATA_RD@) / tma_info_thread_clks - tm= a_mem_bandwidth", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_info_bottleneck_cache_memory_latency, tma_l3_hit_la= tency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", - "MetricExpr": "cpu_core@topdown\\-mem\\-bound@ / (cpu_core@topdown= \\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiri= ng@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", + "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_b= ound_group", "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions.", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "13 * cpu_core@MISC2_RETIRED.LFENCE@ / tma_info_thre= ad_clks", + "MetricExpr": "13 * MISC2_RETIRED.LFENCE / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_memory_fence", - "MetricThreshold": "tma_memory_fence > 0.05 & (tma_serializing_ope= ration > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_memory_fence > 0.05 & tma_serializing_oper= ation > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", - "MetricExpr": "tma_light_operations * cpu_core@MEM_UOP_RETIRED.ANY= @ / (tma_retiring * tma_info_thread_slots)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", + "MetricExpr": "tma_light_operations * MEM_UOP_RETIRED.ANY / (tma_r= etiring * tma_info_thread_slots)", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_memory_operations", "MetricThreshold": "tma_memory_operations > 0.1 & tma_light_operat= ions > 0.6", @@ -2184,27 +4411,27 @@ }, { "BriefDescription": "This metric represents fraction of slots the = CPU was retiring uops fetched by the Microcode Sequencer (MS) unit", - "MetricExpr": "cpu_core@UOPS_RETIRED.MS@ / tma_info_thread_slots", + "MetricExpr": "UOPS_RETIRED.MS / tma_info_thread_slots", "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", "MetricName": "tma_microcode_sequencer", "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", - "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_clears_reste= ers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_clea= rs, tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_bottleneck_i= rregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, t= ma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Branch Resteers as a result of Branch Misprediction= at execution stage", - "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * cpu_= core@INT_MISC.CLEAR_RESTEER_CYCLES@ / tma_info_thread_clks", + "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * INT_= MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_branch_mispredicts, tma_info_bad_spec_branch_misprediction_cost= , tma_info_bottleneck_mispredictions", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the MITE pipeline (the legacy deco= de pipeline)", - "MetricExpr": "(cpu_core@IDQ.MITE_CYCLES_ANY@ - cpu_core@IDQ.MITE_= CYCLES_OK@) / tma_info_core_core_clks / 2", + "MetricExpr": "(IDQ.MITE_CYCLES_ANY - IDQ.MITE_CYCLES_OK) / tma_in= fo_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", "MetricName": "tma_mite", "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", @@ -2213,41 +4440,50 @@ "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", - "MetricExpr": "160 * cpu_core@ASSISTS.SSE_AVX_MIX@ / tma_info_thre= ad_clks", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", + "MetricExpr": "160 * ASSISTS.SSE_AVX_MIX / tma_info_thread_clks", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "MetricExpr": "max(IDQ.MS_CYCLES_ANY, cpu_core@UOPS_RETIRED.MS\\,c= mask\\=3D0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY)) / tma_info_core_cor= e_clks / 2", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", + "MetricName": "tma_ms", + "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", - "MetricExpr": "3 * cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D1\\,edge@ = / (cpu_core@UOPS_RETIRED.SLOTS@ / cpu_core@UOPS_ISSUED.ANY@) / tma_info_thr= ead_clks", + "MetricExpr": "3 * cpu_core@UOPS_RETIRED.MS\\,cmask\\=3D0x1\\,edge= \\=3D0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY) / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: FRONTEND_RETIRED.MS_FLOWS. Related metrics: tm= a_clears_resteers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tm= a_machine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializ= ing_operation", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: FRONTEND_RETIRED.MS_FLOWS. Related metrics: tm= a_bottleneck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_mac= hine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_o= peration", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions that were not fused", - "MetricExpr": "tma_light_operations * (cpu_core@BR_INST_RETIRED.AL= L_BRANCHES@ - cpu_core@INST_RETIRED.MACRO_FUSED@) / (tma_retiring * tma_inf= o_thread_slots)", + "MetricExpr": "tma_light_operations * (BR_INST_RETIRED.ALL_BRANCHE= S - INST_RETIRED.MACRO_FUSED) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_non_fused_branches", "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused.", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring NOP (no op) instructions", - "MetricExpr": "tma_light_operations * cpu_core@INST_RETIRED.NOP@ /= (tma_retiring * tma_info_thread_slots)", + "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2263,89 +4499,89 @@ "Unit": "cpu_core" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", - "MetricExpr": "max(tma_branch_mispredicts * (1 - cpu_core@BR_MISP_= RETIRED.ALL_BRANCHES@ / (cpu_core@INT_MISC.CLEARS_COUNT@ - cpu_core@MACHINE= _CLEARS.COUNT@)), 0.0001)", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", + "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%", "Unit": "cpu_core" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", - "MetricExpr": "max(tma_machine_clears * (1 - cpu_core@MACHINE_CLEA= RS.MEMORY_ORDERING@ / cpu_core@MACHINE_CLEARS.COUNT@), 0.0001)", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", + "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates fraction of slo= ts the CPU retired uops as a result of handing Page Faults", - "MetricExpr": "99 * cpu_core@ASSISTS.PAGE_FAULT@ / tma_info_thread= _slots", + "MetricExpr": "99 * ASSISTS.PAGE_FAULT / tma_info_thread_slots", "MetricGroup": "TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_page_faults", "MetricThreshold": "tma_page_faults > 0.05", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost.", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd b= ranch)", - "MetricExpr": "cpu_core@UOPS_DISPATCHED.PORT_0@ / tma_info_core_co= re_clks", + "MetricExpr": "UOPS_DISPATCHED.PORT_0 / tma_info_core_core_clks", "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_int_vector_128b, tma_int_vector_256b, tma_port_1, tma_port_5, tma_po= rt_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_12= 8b, tma_int_vector_256b, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 1 (ALU)", - "MetricExpr": "cpu_core@UOPS_DISPATCHED.PORT_1@ / tma_info_core_co= re_clks", + "MetricExpr": "UOPS_DISPATCHED.PORT_1 / tma_info_core_core_clks", "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_ve= ctor_256b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vector_256b, tma_port_= 0, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simple= ALU)", - "MetricExpr": "cpu_core@UOPS_DISPATCHED.PORT_6@ / tma_info_core_co= re_clks", + "MetricExpr": "UOPS_DISPATCHED.PORT_6 / tma_info_core_core_clks", "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_6. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_por= t_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128= b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_ports_utilized_2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", - "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (cp= u_core@EXE_ACTIVITY.1_PORTS_UTIL@ + tma_retiring * cpu_core@EXE_ACTIVITY.2_= PORTS_UTIL\\,umask\\=3D0xc@)) / tma_info_thread_clks if cpu_core@ARITH.DIV_= ACTIVE@ < cpu_core@CYCLE_ACTIVITY.STALLS_TOTAL@ - cpu_core@EXE_ACTIVITY.BOU= ND_ON_LOADS@ else (cpu_core@EXE_ACTIVITY.1_PORTS_UTIL@ + tma_retiring * cpu= _core@EXE_ACTIVITY.2_PORTS_UTIL\\,umask\\=3D0xc@) / tma_info_thread_clks)", + "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_3_PORTS_UTIL)) / tm= a_info_thread_clks if ARITH.DIV_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - EXE_= ACTIVITY.BOUND_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EX= E_ACTIVITY.2_3_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", - "MetricExpr": "max((cpu_core@EXE_ACTIVITY.EXE_BOUND_0_PORTS@ + max= (cpu_core@RS.EMPTY_RESOURCE@ - cpu_core@RESOURCE_STALLS.SCOREBOARD@, 0)) / = tma_info_thread_clks, 1) * (cpu_core@CYCLE_ACTIVITY.STALLS_TOTAL@ - cpu_cor= e@EXE_ACTIVITY.BOUND_ON_LOADS@) / tma_info_thread_clks", + "MetricExpr": "max(EXE_ACTIVITY.EXE_BOUND_0_PORTS - RESOURCE_STALL= S.SCOREBOARD, 0) / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "cpu_core@EXE_ACTIVITY.1_PORTS_UTIL@ / tma_info_thre= ad_clks", + "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", "ScaleUnit": "100%", "Unit": "cpu_core" @@ -2353,28 +4589,37 @@ { "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "cpu_core@EXE_ACTIVITY.2_PORTS_UTIL@ / tma_info_thre= ad_clks", + "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma= _int_vector_256b, tma_port_0, tma_port_1, tma_port_6", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "cpu_core@UOPS_EXECUTED.CYCLES_GE_3@ / tma_info_thre= ad_clks", + "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%", "Unit": "cpu_core" }, + { + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to retired misprediction by (indirect) RET instruction= s", + "MetricExpr": "BR_MISP_RETIRED.RET_COST * cpu_core@BR_MISP_RETIRED= .RET_COST@R / tma_info_thread_clks", + "MetricGroup": "BrMispredicts;TopdownL3;tma_L3_group;tma_branch_mi= spredicts_group", + "MetricName": "tma_ret_mispredicts", + "MetricThreshold": "tma_ret_mispredicts > 0.05 & tma_branch_mispre= dicts > 0.1 & tma_bad_speculation > 0.15", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, { "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", - "MetricExpr": "cpu_core@topdown\\-retiring@ / (cpu_core@topdown\\-= fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@= + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", "MetricGroup": "BvUW;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", @@ -2385,98 +4630,98 @@ }, { "BriefDescription": "This metric represents fraction of cycles the= CPU issue-pipeline was stalled due to serializing operations", - "MetricExpr": "cpu_core@RESOURCE_STALLS.SCOREBOARD@ / tma_info_thr= ead_clks + tma_c02_wait", + "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_thread_clks += tma_c02_wait", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring Shuffle operations of 256-bit vector size (FP or Int= eger)", - "MetricExpr": "tma_light_operations * cpu_core@INT_VEC_RETIRED.SHU= FFLES@ / (tma_retiring * tma_info_thread_slots)", + "MetricExpr": "tma_light_operations * INT_VEC_RETIRED.SHUFFLES / (= tma_retiring * tma_info_thread_slots)", "MetricGroup": "HPC;Pipeline;TopdownL4;tma_L4_group;tma_other_ligh= t_ops_group", "MetricName": "tma_shuffles_256b", - "MetricThreshold": "tma_shuffles_256b > 0.1 & (tma_other_light_ops= > 0.3 & tma_light_operations > 0.6)", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers.", + "MetricThreshold": "tma_shuffles_256b > 0.1 & tma_other_light_ops = > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to PAUSE Instructions", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "cpu_core@CPU_CLK_UNHALTED.PAUSE@ / tma_info_thread_= clks", + "MetricExpr": "CPU_CLK_UNHALTED.PAUSE / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_slow_pause", - "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.= PAUSE_INST", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles hand= ling memory load split accesses - load that cross 64-byte cache line bounda= ry", - "MetricExpr": "cpu_core@MEM_INST_RETIRED.SPLIT_LOADS@ * min(cpu_co= re@MEM_INST_RETIRED.SPLIT_LOADS@R, tma_info_memory_load_miss_real_latency) = / tma_info_thread_clks", + "MetricExpr": "(min(MEM_INST_RETIRED.SPLIT_LOADS * cpu_core@MEM_IN= ST_RETIRED.SPLIT_LOADS@R, MEM_INST_RETIRED.SPLIT_LOADS * tma_info_memory_lo= ad_miss_real_latency) if 0 < cpu_core@MEM_INST_RETIRED.SPLIT_LOADS@R else M= EM_INST_RETIRED.SPLIT_LOADS * tma_info_memory_load_miss_real_latency) / tma= _info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents rate of split store ac= cesses", - "MetricExpr": "cpu_core@MEM_INST_RETIRED.SPLIT_STORES@ * min(cpu_c= ore@MEM_INST_RETIRED.SPLIT_STORES@R, 1) / tma_info_thread_clks", + "MetricExpr": "(min(MEM_INST_RETIRED.SPLIT_STORES * cpu_core@MEM_I= NST_RETIRED.SPLIT_STORES@R, MEM_INST_RETIRED.SPLIT_STORES) if 0 < cpu_core@= MEM_INST_RETIRED.SPLIT_STORES@R else MEM_INST_RETIRED.SPLIT_STORES) / tma_i= nfo_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", - "MetricExpr": "(cpu_core@XQ.FULL_CYCLES@ + cpu_core@L1D_PEND_MISS.= L2_STALLS@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricExpr": "(XQ.FULL_CYCLES + L1D_PEND_MISS.L2_STALLS) / tma_in= fo_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_bottleneck_cache_memory_bandwidth, tma_info_system_dram_bw_use, = tma_mem_bandwidth", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_mem_bandwidth", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates how often CPU was stall= ed due to RFO store memory accesses; RFO store issue a read-for-ownership = request before the write", - "MetricExpr": "cpu_core@EXE_ACTIVITY.BOUND_ON_STORES@ / tma_info_t= hread_clks", + "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric roughly estimates fraction of cyc= les when the memory subsystem had loads blocked since they could not forwar= d data from earlier (in program order) overlapping stores", - "MetricExpr": "13 * cpu_core@LD_BLOCKS.STORE_FORWARD@ / tma_info_t= hread_clks", + "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", - "MetricExpr": "(cpu_core@MEM_STORE_RETIRED.L2_HIT@ * 10 * (1 - cpu= _core@MEM_INST_RETIRED.LOCK_LOADS@ / cpu_core@MEM_INST_RETIRED.ALL_STORES@)= + (1 - cpu_core@MEM_INST_RETIRED.LOCK_LOADS@ / cpu_core@MEM_INST_RETIRED.A= LL_STORES@) * min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFFCORE_REQUE= STS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO@)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricExpr": "(MEM_STORE_RETIRED.L2_HIT * 10 * (1 - MEM_INST_RETI= RED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_= LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE= _REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution port for Store operations", - "MetricExpr": "(cpu_core@UOPS_DISPATCHED.PORT_4_9@ + cpu_core@UOPS= _DISPATCHED.PORT_7_8@) / (4 * tma_info_core_core_clks)", + "MetricExpr": "(UOPS_DISPATCHED.PORT_4_9 + UOPS_DISPATCHED.PORT_7_= 8) / (4 * tma_info_core_core_clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_store_op_utilization", "MetricThreshold": "tma_store_op_utilization > 0.6", @@ -2489,46 +4734,73 @@ "MetricExpr": "max(0, tma_dtlb_store - tma_store_stlb_miss)", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates the fraction of cycles = where the STLB was missed by store accesses, performing a hardware page wal= k", - "MetricExpr": "cpu_core@DTLB_STORE_MISSES.WALK_ACTIVE@ / tma_info_= core_core_clks", + "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%", + "Unit": "cpu_core" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric estimates how often CPU was stall= ed due to Streaming store memory accesses; Streaming store optimize out a = read request required by RFO stores", - "MetricExpr": "9 * cpu_core@OCR.STREAMING_WR.ANY_RESPONSE@ / tma_i= nfo_thread_clks", + "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", "MetricName": "tma_streaming_stores", - "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound = > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to new branch address clears", - "MetricExpr": "cpu_core@INT_MISC.UNKNOWN_BRANCH_CYCLES@ / tma_info= _thread_clks", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / tma_info_thread_cl= ks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", "ScaleUnit": "100%", "Unit": "cpu_core" }, { "BriefDescription": "This metric serves as an approximation of leg= acy x87 usage", - "MetricExpr": "tma_retiring * cpu_core@UOPS_EXECUTED.X87@ / cpu_co= re@UOPS_EXECUTED.THREAD@", + "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", "ScaleUnit": "100%", "Unit": "cpu_core" } diff --git a/tools/perf/pmu-events/arch/x86/meteorlake/other.json b/tools/p= erf/pmu-events/arch/x86/meteorlake/other.json index 53d23d8decc6..46a21776a4e9 100644 --- a/tools/perf/pmu-events/arch/x86/meteorlake/other.json +++ b/tools/perf/pmu-events/arch/x86/meteorlake/other.json @@ -1,4 +1,14 @@ [ + { + "BriefDescription": "Count all other hardware assists or traps tha= t are not necessarily architecturally exposed (through a software handler) = beyond FP; SSE-AVX mix and A/D assists who are counted by dedicated sub-eve= nts. the event also counts for Machine Ordering count.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc1", + "EventName": "ASSISTS.HARDWARE", + "PublicDescription": "Count all other hardware assists or traps th= at are not necessarily architecturally exposed (through a software handler)= beyond FP; SSE-AVX mix and A/D assists who are counted by dedicated sub-ev= ents. This includes, but not limited to, assists at EXE or MEM uop writeba= ck like AVX* load/store/gather/scatter (non-FP GSSE-assist ) , assists gene= rated by ROB like PEBS and RTIT, Uncore trap, RAR (Remote Action Request) a= nd CET (Control flow Enforcement Technology) assists. the event also counts= for Machine Ordering count.", + "SampleAfterValue": "100003", + "UMask": "0x4", + "Unit": "cpu_core" + }, { "BriefDescription": "ASSISTS.PAGE_FAULT", "Counter": "0,1,2,3,4,5,6,7", @@ -18,6 +28,28 @@ "UMask": "0x1", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that have any type of response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x10004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts demand instruction fetches and L1 inst= ruction cache prefetches that were supplied by DRAM.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_CODE_RD.DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000004", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts demand data reads that have any type o= f response.", "Counter": "0,1,2,3,4,5,6,7", @@ -95,6 +127,28 @@ "UMask": "0x1", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts streaming stores which modify a full 6= 4 byte cacheline that have any type of response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.FULL_STREAMING_WR.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x800000010000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts streaming stores which modify only par= t of a 64 byte cacheline that have any type of response.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.PARTIAL_STREAMING_WR.ANY_RESPONSE", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x400000010000", + "SampleAfterValue": "100003", + "UMask": "0x1", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts streaming stores that have any type of= response.", "Counter": "0,1,2,3,4,5,6,7", diff --git a/tools/perf/pmu-events/arch/x86/meteorlake/pipeline.json b/tool= s/perf/pmu-events/arch/x86/meteorlake/pipeline.json index bc806c7330f4..265f6c5a0248 100644 --- a/tools/perf/pmu-events/arch/x86/meteorlake/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/meteorlake/pipeline.json @@ -1,6 +1,6 @@ [ { - "BriefDescription": "Counts the number of cycles when any of the d= ividers are active.", + "BriefDescription": "Counts the number of cycles when any of the f= loating point or integer dividers are active.", "Counter": "0,1,2,3,4,5,6,7", "CounterMask": "1", "EventCode": "0xcd", @@ -54,7 +54,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all branch instructions retired.", "SampleAfterValue": "400009", "Unit": "cpu_core" @@ -73,7 +72,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts conditional branch instructions retir= ed.", "SampleAfterValue": "400009", "UMask": "0x11", @@ -84,7 +82,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts not taken branch instructions retired= .", "SampleAfterValue": "400009", "UMask": "0x10", @@ -104,7 +101,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional branch instructions= retired.", "SampleAfterValue": "400009", "UMask": "0x1", @@ -124,7 +120,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.FAR_BRANCH", - "PEBS": "1", "PublicDescription": "Counts far branch instructions retired.", "SampleAfterValue": "100007", "UMask": "0x40", @@ -144,7 +139,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts near indirect branch instructions ret= ired excluding returns. TSX abort is an indirect branch.", "SampleAfterValue": "100003", "UMask": "0x80", @@ -159,6 +153,15 @@ "UMask": "0xfb", "Unit": "cpu_atom" }, + { + "BriefDescription": "Counts the number of near indirect JMP branch= instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.INDIRECT_JMP", + "SampleAfterValue": "200003", + "UMask": "0xef", + "Unit": "cpu_atom" + }, { "BriefDescription": "This event is deprecated. Refer to new event = BR_INST_RETIRED.INDIRECT_CALL", "Counter": "0,1,2,3,4,5,6,7", @@ -183,7 +186,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_CALL", - "PEBS": "1", "PublicDescription": "Counts both direct and indirect near call in= structions retired.", "SampleAfterValue": "100007", "UMask": "0x2", @@ -203,23 +205,48 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_RETURN", - "PEBS": "1", "PublicDescription": "Counts return instructions retired.", "SampleAfterValue": "100007", "UMask": "0x8", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the number of near taken branch instru= ctions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.NEAR_TAKEN", + "SampleAfterValue": "200003", + "UMask": "0xc0", + "Unit": "cpu_atom" + }, { "BriefDescription": "Taken branch instructions retired.", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken branch instructions retired.", "SampleAfterValue": "400009", "UMask": "0x20", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the number of near relative CALL branc= h instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.REL_CALL", + "SampleAfterValue": "200003", + "UMask": "0xfd", + "Unit": "cpu_atom" + }, + { + "BriefDescription": "Counts the number of near relative JMP branch= instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc4", + "EventName": "BR_INST_RETIRED.REL_JMP", + "SampleAfterValue": "200003", + "UMask": "0xdf", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts the total number of mispredicted branc= h instructions retired for all branch types.", "Counter": "0,1,2,3,4,5,6,7", @@ -234,7 +261,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all the retired branch instructions t= hat were mispredicted by the processor. A branch misprediction occurs when = the processor incorrectly predicts the destination of the branch. When the= misprediction is discovered at execution, all the instructions executed in= the wrong (speculative) path must be discarded, and the processor must sta= rt fetching from the correct path.", "SampleAfterValue": "400009", "Unit": "cpu_core" @@ -244,7 +270,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.ALL_BRANCHES_COST", - "PEBS": "1", "SampleAfterValue": "400009", "UMask": "0x44", "Unit": "cpu_core" @@ -263,7 +288,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts mispredicted conditional branch instr= uctions retired.", "SampleAfterValue": "400009", "UMask": "0x11", @@ -274,7 +298,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_COST", - "PEBS": "1", "SampleAfterValue": "400009", "UMask": "0x51", "Unit": "cpu_core" @@ -284,7 +307,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts the number of conditional branch inst= ructions retired that were mispredicted and the branch direction was not ta= ken.", "SampleAfterValue": "400009", "UMask": "0x10", @@ -295,7 +317,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_NTAKEN_COST", - "PEBS": "1", "SampleAfterValue": "400009", "UMask": "0x50", "Unit": "cpu_core" @@ -314,7 +335,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional mispredicted branch= instructions retired.", "SampleAfterValue": "400009", "UMask": "0x1", @@ -325,7 +345,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_TAKEN_COST", - "PEBS": "1", "SampleAfterValue": "400009", "UMask": "0x41", "Unit": "cpu_core" @@ -344,7 +363,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts miss-predicted near indirect branch i= nstructions retired excluding returns. TSX abort is an indirect branch.", "SampleAfterValue": "100003", "UMask": "0x80", @@ -364,7 +382,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT_CALL", - "PEBS": "1", "PublicDescription": "Counts retired mispredicted indirect (near t= aken) CALL instructions, including both register and memory indirect.", "SampleAfterValue": "400009", "UMask": "0x2", @@ -375,7 +392,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT_CALL_COST", - "PEBS": "1", "SampleAfterValue": "400009", "UMask": "0x42", "Unit": "cpu_core" @@ -385,11 +401,19 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT_COST", - "PEBS": "1", "SampleAfterValue": "100003", "UMask": "0xc0", "Unit": "cpu_core" }, + { + "BriefDescription": "Counts the number of mispredicted near indire= ct JMP branch instructions retired.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xc5", + "EventName": "BR_MISP_RETIRED.INDIRECT_JMP", + "SampleAfterValue": "200003", + "UMask": "0xef", + "Unit": "cpu_atom" + }, { "BriefDescription": "Counts the number of mispredicted near taken = branch instructions retired.", "Counter": "0,1,2,3,4,5,6,7", @@ -404,7 +428,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts number of near branch instructions re= tired that were mispredicted and taken.", "SampleAfterValue": "400009", "UMask": "0x20", @@ -415,7 +438,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.NEAR_TAKEN_COST", - "PEBS": "1", "SampleAfterValue": "400009", "UMask": "0x60", "Unit": "cpu_core" @@ -425,7 +447,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.RET", - "PEBS": "1", "PublicDescription": "This is a non-precise version (that is, does= not use PEBS) of the event that counts mispredicted return instructions re= tired.", "SampleAfterValue": "100007", "UMask": "0x8", @@ -445,7 +466,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.RET_COST", - "PEBS": "1", "SampleAfterValue": "100007", "UMask": "0x48", "Unit": "cpu_core" @@ -771,7 +791,6 @@ "BriefDescription": "Fixed Counter: Counts the number of instructi= ons retired", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x1", "Unit": "cpu_atom" @@ -780,7 +799,6 @@ "BriefDescription": "Number of instructions retired. Fixed Counter= - architectural event", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003", "UMask": "0x1", @@ -799,7 +817,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.ANY_P", - "PEBS": "1", "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003", "Unit": "cpu_core" @@ -809,7 +826,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.MACRO_FUSED", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x10", "Unit": "cpu_core" @@ -819,7 +835,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.NOP", - "PEBS": "1", "PublicDescription": "Counts all retired NOP or ENDBR32/64 or PREF= ETCHIT0/1 instructions", "SampleAfterValue": "2000003", "UMask": "0x2", @@ -829,7 +844,6 @@ "BriefDescription": "Precise instruction retired with PEBS precise= -distribution", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.PREC_DIST", - "PEBS": "1", "PublicDescription": "A version of INST_RETIRED that allows for a = precise distribution of samples across instructions retired. It utilizes th= e Precise Distribution of Instructions Retired (PDIR++) feature to fix bias= in how retired instructions get sampled. Use on Fixed Counter 0.", "SampleAfterValue": "2000003", "UMask": "0x1", @@ -840,7 +854,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.REP_ITERATION", - "PEBS": "1", "PublicDescription": "Number of iterations of Repeat (REP) string = retired instructions such as MOVS, CMPS, and SCAS. Each has a byte, word, a= nd doubleword version and string instructions can be repeated using a repet= ition prefix, REP, that allows their architectural execution to be repeated= a number of times as specified by the RCX register. Note the number of ite= rations is implementation-dependent.", "SampleAfterValue": "2000003", "UMask": "0x8", @@ -1176,6 +1189,16 @@ "UMask": "0x20", "Unit": "cpu_core" }, + { + "BriefDescription": "Cycles stalled due to no store buffers availa= ble. (not including draining form sync).", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xa2", + "EventName": "RESOURCE_STALLS.SB", + "PublicDescription": "Counts allocation stall cycles caused by the= store buffer (SB) being full. This counts cycles that the pipeline back-en= d blocked uop delivery from the front-end.", + "SampleAfterValue": "100003", + "UMask": "0x8", + "Unit": "cpu_core" + }, { "BriefDescription": "Counts cycles where the pipeline is stalled d= ue to serializing operations.", "Counter": "0,1,2,3,4,5,6,7", --=20 2.48.0.rc2.279.g1de40edade-goog From nobody Sun Dec 14 13:49:51 2025 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 95032198857 for ; Thu, 16 Jan 2025 06:44:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009928; cv=none; b=SE/8OF3R0Lqlg7LM8hjlBQVRiKDy1WXfYbsZ+cWTkmQA+ARzTy0uk2dBVw+dswo3f0b+kOXg8pQNgM3t01hGS/l/Zryye/+qBav+eH1LlgxKOYns78jAbCU4n+SlTLG2f4Iy9u6UJbeMMI+9ruVB443iWnEH2/cW8gWuOqV+PyQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009928; c=relaxed/simple; bh=wIkqjMDKk/oQoleRnmtKsBnUyI2hzQSVRQwq7GLyzp4=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=gs0ZG75cD2eGuzSolEydVxBBtl70nmiMLWHfq+aLofCV9CY9XXqPXzvnHKEtXyQffRNCvVTD6UWogaJnpQQsagT65a3uC5CNcQLv9934rHPPg+B2n2I+L/O6KV4JjS1O9Opx0NMuNBRjzDJ8cqRagFK9nLuIhP3i1vJUKhIq0mY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=IYgDyBi8; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="IYgDyBi8" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-e572cd106f7so1637820276.3 for ; Wed, 15 Jan 2025 22:44:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737009891; x=1737614691; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=bPm3Fv3wFuo5nv8lx8WNFKqbWdOCkb4B/Z1sSH6aemw=; b=IYgDyBi8dHPnqjY9X/YUv779p0PETCDtp0qJbjGt3HUpqvrAg/SmCyfKqXi0myOEoq V9ileZdIOI7FbqoFSkM3zsjk72qb1uiuHp2on7M/2SbVfgrq6B+tvbLRqL7Pbgyu4fy8 Wt66aGyEQtyf7dkd1bqAqraXzsvwBEQQXiiODIMFikQnD8yrZY3e3GbodLlQvOlabRGL j6nwgp60O6P5t8QzBSzCSry9rwryKGaToqkTYLIw3Y2Xr+N6JNB9utgC0Fb+bYmBEA9f p686+Jpzfnu2Lx55wgkS0T71NMzNt92/u0qJ7f5cMnM/XUltyPT0Z6jHxVv0uU7oUPhm dYSg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737009891; x=1737614691; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=bPm3Fv3wFuo5nv8lx8WNFKqbWdOCkb4B/Z1sSH6aemw=; b=juVk1lJzOf3j7/AYUFsOMOJD1f0LqxNPtC7Uoo6v5SolG34Hx0tSinL9maWH4Mt4Lu eGtFbyc+g9lMoomSOfkd1QvGK9axM8cj5auXPMBgFOCxlEUV4yPodXA1CCCtuU6zpeE1 Su6I1H5FktSty0J2gigrgXybkgsCfmnK/vW6v3N0/lB4CeAqT8SqqFc0Ptps60kN9L28 co2GjrZUMyREQYeV/xzGNsfFmyR0lZf0KaPmmYFRXRdCPmyVgWV8o9bjke41rfjco+HQ qF5wt4paLIP6CQfTL5XtsU8mb6deDWISf+5eabNq7g/Ze7utWmIKDTrnDT4fQ9SM/Hng DEIQ== X-Forwarded-Encrypted: i=1; AJvYcCUMzIWuemMcwOQ8wFUp+EUX0SQkqKwjz04omde12Zf3qDo4pBcRJFrIGuI7kuoQU0SnJiCkdZWu9kUdXJo=@vger.kernel.org X-Gm-Message-State: AOJu0YwWGtZPeEnUS0BxEL1I2govSPe/nP2X5XQ1YqZ06qY+1lXO6ol4 9JKpDeMHp1JPgu8WEcxVNJOf/NhYIo836ln31dwJAxjsybJrcppZWRU/hM8I0dW44WiqcEF7zSV iunoPAw== X-Google-Smtp-Source: AGHT+IHXpHkjpfSyeKiqanNRxYlBufPo+IJXU/WGquC6dqVmBd1zi3KNdjN+ktnliXXKANyODqiOV5jFUe/H X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:7bd:ce19:c701:bfa3]) (user=irogers job=sendgmr) by 2002:a25:a486:0:b0:e57:5037:5b64 with SMTP id 3f1490d57ef6-e575037630fmr41126276.5.1737009890844; Wed, 15 Jan 2025 22:44:50 -0800 (PST) Date: Wed, 15 Jan 2025 22:43:50 -0800 In-Reply-To: <20250116064355.344245-1-irogers@google.com> Message-Id: <20250116064355.344245-19-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250116064355.344245-1-irogers@google.com> X-Mailer: git-send-email 2.48.0.rc2.279.g1de40edade-goog Subject: [PATCH v2 18/23] perf vendor events: Update Rocketlake events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.03 to v1.04. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v1.04: https://github.com/intel/perfmon/commit/015d5a5eab6850e6367ee4f82e4808e166e= af5a5 The TMA 5.01 addition is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-developed-by: Caleb Biggers Signed-off-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- .../pmu-events/arch/x86/rocketlake/cache.json | 34 +- .../arch/x86/rocketlake/frontend.json | 17 - .../arch/x86/rocketlake/memory.json | 13 +- .../arch/x86/rocketlake/metricgroups.json | 10 +- .../arch/x86/rocketlake/pipeline.json | 30 +- .../arch/x86/rocketlake/rkl-metrics.json | 849 +++++++++--------- .../x86/rocketlake/uncore-interconnect.json | 10 - .../arch/x86/rocketlake/uncore-other.json | 2 +- .../arch/x86/rocketlake/virtual-memory.json | 18 + 10 files changed, 500 insertions(+), 485 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 50304f1056e7..4b21fd6e9b15 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -26,7 +26,7 @@ GenuineIntel-6-BD,v1.10,lunarlake,core GenuineIntel-6-(AA|AC|B5),v1.12,meteorlake,core GenuineIntel-6-1[AEF],v4,nehalemep,core GenuineIntel-6-2E,v4,nehalemex,core -GenuineIntel-6-A7,v1.03,rocketlake,core +GenuineIntel-6-A7,v1.04,rocketlake,core GenuineIntel-6-2A,v19,sandybridge,core GenuineIntel-6-8F,v1.23,sapphirerapids,core GenuineIntel-6-AF,v1.04,sierraforest,core diff --git a/tools/perf/pmu-events/arch/x86/rocketlake/cache.json b/tools/p= erf/pmu-events/arch/x86/rocketlake/cache.json index 2e93b7835b41..791fa526d192 100644 --- a/tools/perf/pmu-events/arch/x86/rocketlake/cache.json +++ b/tools/perf/pmu-events/arch/x86/rocketlake/cache.json @@ -75,11 +75,11 @@ "UMask": "0x2" }, { - "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache when triggered by an L2 cache fill.", + "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache.", "Counter": "0,1,2,3", "EventCode": "0xF2", "EventName": "L2_LINES_OUT.SILENT", - "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache when triggered by an L2 cache fill. These lines are ty= pically in Shared or Exclusive state. A non-threaded event.", + "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache. These lines are typically in Shared or Exclusive stat= e. A non-threaded event.", "SampleAfterValue": "200003", "UMask": "0x1" }, @@ -251,7 +251,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_LOADS", - "PEBS": "1", "PublicDescription": "Counts all retired load instructions. This e= vent accounts for SW prefetch instructions of PREFETCHNTA or PREFETCHT0/1/2= or PREFETCHW.", "SampleAfterValue": "1000003", "UMask": "0x81" @@ -262,7 +261,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_STORES", - "PEBS": "1", "PublicDescription": "Counts all retired store instructions.", "SampleAfterValue": "1000003", "UMask": "0x82" @@ -273,7 +271,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts all retired memory instructions - loa= ds and stores.", "SampleAfterValue": "1000003", "UMask": "0x83" @@ -284,7 +281,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.LOCK_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with locked= access.", "SampleAfterValue": "100007", "UMask": "0x21" @@ -295,7 +291,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions that split = across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x41" @@ -306,7 +301,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_STORES", - "PEBS": "1", "PublicDescription": "Counts retired store instructions that split= across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x42" @@ -317,7 +311,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_LOADS", - "PEBS": "1", "PublicDescription": "Number of retired load instructions that (st= art a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x11" @@ -328,7 +321,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_STORES", - "PEBS": "1", "PublicDescription": "Number of retired store instructions that (s= tart a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x12" @@ -339,7 +331,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were L3 and cross-core snoop hits in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x2" @@ -350,7 +341,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were HitM responses from shared L3.", "SampleAfterValue": "20011", "UMask": "0x4" @@ -361,7 +351,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS", - "PEBS": "1", "PublicDescription": "Counts the retired load instructions whose d= ata sources were L3 hit and cross-core snoop missed in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x1" @@ -372,7 +361,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NONE", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were hits in L3 without snoops required.", "SampleAfterValue": "100003", "UMask": "0x8" @@ -383,7 +371,6 @@ "Data_LA": "1", "EventCode": "0xd4", "EventName": "MEM_LOAD_MISC_RETIRED.UC", - "PEBS": "1", "PublicDescription": "Retired instructions with at least one load = to uncacheable memory-type, or at least one cache-line split locked access = (Bus Lock).", "SampleAfterValue": "100007", "UMask": "0x4" @@ -394,7 +381,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.FB_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop was load missed in L1 but hit FB (Fill Buffers) due to preceding= miss to the same cache line with data not ready.", "SampleAfterValue": "100007", "UMask": "0x40" @@ -405,7 +391,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L1 data cache. This event includes all SW prefet= ches and lock instructions regardless of the data source.", "SampleAfterValue": "1000003", "UMask": "0x1" @@ -416,7 +401,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L1 cache.", "SampleAfterValue": "200003", "UMask": "0x8" @@ -427,7 +411,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with L2 cac= he hits as data sources.", "SampleAfterValue": "200003", "UMask": "0x2" @@ -438,7 +421,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions missed L2 c= ache as data sources.", "SampleAfterValue": "100021", "UMask": "0x10" @@ -449,7 +431,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L3 cache.", "SampleAfterValue": "100021", "UMask": "0x4" @@ -460,7 +441,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L3 cache.", "SampleAfterValue": "50021", "UMask": "0x20" @@ -910,6 +890,16 @@ "SampleAfterValue": "1000003", "UMask": "0x8" }, + { + "BriefDescription": "Cycles with outstanding code read requests pe= nding.", + "Counter": "0,1,2,3", + "CounterMask": "1", + "EventCode": "0x60", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_CODE= _RD", + "PublicDescription": "Cycles with outstanding code read requests p= ending. Code Read requests include both cacheable and non-cacheable Code R= eads. Requests are considered outstanding from the time they miss the core= 's L2 cache until the transaction completion message is sent to the request= or.", + "SampleAfterValue": "1000003", + "UMask": "0x2" + }, { "BriefDescription": "Cycles where at least 1 outstanding Demand RF= O request is pending.", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/rocketlake/frontend.json b/tool= s/perf/pmu-events/arch/x86/rocketlake/frontend.json index e7c7d4d4152d..7afa2436d584 100644 --- a/tools/perf/pmu-events/arch/x86/rocketlake/frontend.json +++ b/tools/perf/pmu-events/arch/x86/rocketlake/frontend.json @@ -44,7 +44,6 @@ "EventName": "FRONTEND_RETIRED.ANY_DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x1", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= DSB (Decode stream buffer i.e. the decoded instruction-cache) miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -56,7 +55,6 @@ "EventName": "FRONTEND_RETIRED.DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x11", - "PEBS": "1", "PublicDescription": "Number of retired Instructions that experien= ced a critical DSB (Decode stream buffer i.e. the decoded instruction-cache= ) miss. Critical means stalls were exposed to the back-end as a result of t= he DSB miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -68,7 +66,6 @@ "EventName": "FRONTEND_RETIRED.ITLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x14", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= iTLB (Instruction TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -80,7 +77,6 @@ "EventName": "FRONTEND_RETIRED.L1I_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x12", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L1 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -92,7 +88,6 @@ "EventName": "FRONTEND_RETIRED.L2_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x13", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L2 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -104,7 +99,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x500106", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 1 cycle which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -116,7 +110,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_128", "MSRIndex": "0x3F7", "MSRValue": "0x508006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 12= 8 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -128,7 +121,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_16", "MSRIndex": "0x3F7", "MSRValue": "0x501006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 16 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -140,7 +132,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2", "MSRIndex": "0x3F7", "MSRValue": "0x500206", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -152,7 +143,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_256", "MSRIndex": "0x3F7", "MSRValue": "0x510006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 25= 6 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -164,7 +154,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x100206", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after the front-end had at least 1 bubble-slot for a per= iod of 2 cycles. A bubble-slot is an empty issue-pipeline slot while there = was no RAT stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -176,7 +165,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_32", "MSRIndex": "0x3F7", "MSRValue": "0x502006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 32 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -188,7 +176,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_4", "MSRIndex": "0x3F7", "MSRValue": "0x500406", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 4 = cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -200,7 +187,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_512", "MSRIndex": "0x3F7", "MSRValue": "0x520006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 51= 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -212,7 +198,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_64", "MSRIndex": "0x3F7", "MSRValue": "0x504006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 64= cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -224,7 +209,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_8", "MSRIndex": "0x3F7", "MSRValue": "0x500806", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 8 cycles. During thi= s period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -236,7 +220,6 @@ "EventName": "FRONTEND_RETIRED.STLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x15", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= STLB (2nd level TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x1" diff --git a/tools/perf/pmu-events/arch/x86/rocketlake/memory.json b/tools/= perf/pmu-events/arch/x86/rocketlake/memory.json index f73035f44330..abaf3f4f9d63 100644 --- a/tools/perf/pmu-events/arch/x86/rocketlake/memory.json +++ b/tools/perf/pmu-events/arch/x86/rocketlake/memory.json @@ -88,7 +88,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_128", "MSRIndex": "0x3F6", "MSRValue": "0x80", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 128 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "1009", "UMask": "0x1" @@ -101,7 +100,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_16", "MSRIndex": "0x3F6", "MSRValue": "0x10", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 16 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "20011", "UMask": "0x1" @@ -114,7 +112,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_256", "MSRIndex": "0x3F6", "MSRValue": "0x100", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 256 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "503", "UMask": "0x1" @@ -127,7 +124,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32", "MSRIndex": "0x3F6", "MSRValue": "0x20", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 32 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -140,7 +136,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4", "MSRIndex": "0x3F6", "MSRValue": "0x4", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 4 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "100003", "UMask": "0x1" @@ -153,7 +148,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_512", "MSRIndex": "0x3F6", "MSRValue": "0x200", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 512 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "101", "UMask": "0x1" @@ -166,7 +160,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64", "MSRIndex": "0x3F6", "MSRValue": "0x40", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 64 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "2003", "UMask": "0x1" @@ -179,7 +172,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_8", "MSRIndex": "0x3F6", "MSRValue": "0x8", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 8 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "50021", "UMask": "0x1" @@ -287,17 +279,16 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc9", "EventName": "RTM_RETIRED.ABORTED", - "PEBS": "1", "PublicDescription": "Counts the number of times RTM abort was tri= ggered.", "SampleAfterValue": "100003", "UMask": "0x4" }, { - "BriefDescription": "Number of times an RTM execution aborted due = to none of the previous 4 categories (e.g. interrupt)", + "BriefDescription": "Number of times an RTM execution aborted due = to none of the previous 3 categories (e.g. interrupt)", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc9", "EventName": "RTM_RETIRED.ABORTED_EVENTS", - "PublicDescription": "Counts the number of times an RTM execution = aborted due to none of the previous 4 categories (e.g. interrupt).", + "PublicDescription": "Counts the number of times an RTM execution = aborted due to none of the previous 3 categories (e.g. interrupt).", "SampleAfterValue": "100003", "UMask": "0x80" }, diff --git a/tools/perf/pmu-events/arch/x86/rocketlake/metricgroups.json b/= tools/perf/pmu-events/arch/x86/rocketlake/metricgroups.json index 3a88260194d1..80ca8021f2de 100644 --- a/tools/perf/pmu-events/arch/x86/rocketlake/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/rocketlake/metricgroups.json @@ -37,6 +37,7 @@ "InsType": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -83,7 +84,9 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_mispredicts_group": "Metrics contributing to tma_branch_mi= spredicts category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", + "tma_code_stlb_miss_group": "Metrics contributing to tma_code_stlb_mis= s category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -93,6 +96,7 @@ "tma_fp_vector_group": "Metrics contributing to tma_fp_vector category= ", "tma_frontend_bound_group": "Metrics contributing to tma_frontend_boun= d category", "tma_heavy_operations_group": "Metrics contributing to tma_heavy_opera= tions category", + "tma_icache_misses_group": "Metrics contributing to tma_icache_misses = category", "tma_issue2P": "Metrics related by the issue $issue2P", "tma_issueBM": "Metrics related by the issue $issueBM", "tma_issueBW": "Metrics related by the issue $issueBW", @@ -112,10 +116,13 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", + "tma_l2_bound_group": "Metrics contributing to tma_l2_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", "tma_load_op_utilization_group": "Metrics contributing to tma_load_op_= utilization category", + "tma_load_stlb_miss_group": "Metrics contributing to tma_load_stlb_mis= s category", "tma_machine_clears_group": "Metrics contributing to tma_machine_clear= s category", "tma_mem_latency_group": "Metrics contributing to tma_mem_latency cate= gory", "tma_memory_bound_group": "Metrics contributing to tma_memory_bound ca= tegory", @@ -128,5 +135,6 @@ "tma_retiring_group": "Metrics contributing to tma_retiring category", "tma_serializing_operation_group": "Metrics contributing to tma_serial= izing_operation category", "tma_store_bound_group": "Metrics contributing to tma_store_bound cate= gory", - "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category" + "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category", + "tma_store_stlb_miss_group": "Metrics contributing to tma_store_stlb_m= iss category" } diff --git a/tools/perf/pmu-events/arch/x86/rocketlake/pipeline.json b/tool= s/perf/pmu-events/arch/x86/rocketlake/pipeline.json index 4fdf07c7beb7..44659f26cbab 100644 --- a/tools/perf/pmu-events/arch/x86/rocketlake/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/rocketlake/pipeline.json @@ -9,6 +9,15 @@ "SampleAfterValue": "1000003", "UMask": "0x9" }, + { + "BriefDescription": "ARITH.FP_DIVIDER_ACTIVE", + "Counter": "0,1,2,3,4,5,6,7", + "CounterMask": "1", + "EventCode": "0x14", + "EventName": "ARITH.FP_DIVIDER_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x1" + }, { "BriefDescription": "Number of occurrences where a microcode assis= t is invoked by hardware.", "Counter": "0,1,2,3,4,5,6,7", @@ -23,7 +32,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all branch instructions retired.", "SampleAfterValue": "400009" }, @@ -32,7 +40,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts conditional branch instructions retir= ed.", "SampleAfterValue": "400009", "UMask": "0x11" @@ -42,7 +49,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts not taken branch instructions retired= .", "SampleAfterValue": "400009", "UMask": "0x10" @@ -52,7 +58,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional branch instructions= retired.", "SampleAfterValue": "400009", "UMask": "0x1" @@ -62,7 +67,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.FAR_BRANCH", - "PEBS": "1", "PublicDescription": "Counts far branch instructions retired.", "SampleAfterValue": "100007", "UMask": "0x40" @@ -72,7 +76,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts near indirect branch instructions ret= ired excluding returns. TSX abort is an indirect branch.", "SampleAfterValue": "100003", "UMask": "0x80" @@ -82,7 +85,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_CALL", - "PEBS": "1", "PublicDescription": "Counts both direct and indirect near call in= structions retired.", "SampleAfterValue": "100007", "UMask": "0x2" @@ -92,7 +94,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_RETURN", - "PEBS": "1", "PublicDescription": "Counts return instructions retired.", "SampleAfterValue": "100007", "UMask": "0x8" @@ -102,7 +103,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken branch instructions retired.", "SampleAfterValue": "400009", "UMask": "0x20" @@ -112,7 +112,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all the retired branch instructions t= hat were mispredicted by the processor. A branch misprediction occurs when = the processor incorrectly predicts the destination of the branch. When the= misprediction is discovered at execution, all the instructions executed in= the wrong (speculative) path must be discarded, and the processor must sta= rt fetching from the correct path.", "SampleAfterValue": "50021" }, @@ -121,7 +120,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts mispredicted conditional branch instr= uctions retired.", "SampleAfterValue": "50021", "UMask": "0x11" @@ -131,7 +129,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts the number of conditional branch inst= ructions retired that were mispredicted and the branch direction was not ta= ken.", "SampleAfterValue": "50021", "UMask": "0x10" @@ -141,7 +138,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional mispredicted branch= instructions retired.", "SampleAfterValue": "50021", "UMask": "0x1" @@ -151,7 +147,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts all miss-predicted indirect branch in= structions retired (excluding RETs. TSX aborts is considered indirect branc= h).", "SampleAfterValue": "50021", "UMask": "0x80" @@ -161,7 +156,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT_CALL", - "PEBS": "1", "PublicDescription": "Counts retired mispredicted indirect (near t= aken) CALL instructions, including both register and memory indirect.", "SampleAfterValue": "50021", "UMask": "0x2" @@ -171,7 +165,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts number of near branch instructions re= tired that were mispredicted and taken.", "SampleAfterValue": "50021", "UMask": "0x20" @@ -181,7 +174,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.RET", - "PEBS": "1", "PublicDescription": "This is a non-precise version (that is, does= not use PEBS) of the event that counts mispredicted return instructions re= tired.", "SampleAfterValue": "50021", "UMask": "0x8" @@ -377,7 +369,6 @@ "BriefDescription": "Number of instructions retired. Fixed Counter= - architectural event", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts the number of instructions retired - = an Architectural PerfMon event. Counting continues during hardware interrup= ts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is counte= d by a designated fixed counter freeing up programmable counters to count o= ther events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003", "UMask": "0x1" @@ -387,7 +378,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.ANY_P", - "PEBS": "1", "PublicDescription": "Counts the number of instructions retired - = an Architectural PerfMon event. Counting continues during hardware interrup= ts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is counte= d by a designated fixed counter freeing up programmable counters to count o= ther events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003" }, @@ -396,7 +386,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.NOP", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x2" }, @@ -404,7 +393,6 @@ "BriefDescription": "Precise instruction retired event with a redu= ced effect of PEBS shadow in IP distribution", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.PREC_DIST", - "PEBS": "1", "PublicDescription": "A version of INST_RETIRED that allows for a = more unbiased distribution of samples across instructions retired. It utili= zes the Precise Distribution of Instructions Retired (PDIR) feature to miti= gate some bias in how retired instructions get sampled. Use on Fixed Counte= r 0.", "SampleAfterValue": "2000003", "UMask": "0x1" diff --git a/tools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json b/t= ools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json index 13474af97786..cfda8956353e 100644 --- a/tools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json +++ b/tools/perf/pmu-events/arch/x86/rocketlake/rkl-metrics.json @@ -89,12 +89,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", + "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + = UOPS_DISPATCHED.PORT_5 + UOPS_DISPATCHED.PORT_6) / (4 * tma_info_core_core_= clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -106,7 +106,7 @@ "MetricExpr": "34 * ASSISTS.ANY / tma_info_thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%" }, @@ -129,11 +129,104 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions.", + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Ret", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4k_alias= ing + tma_fb_full)))", + "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_4k_aliasing + tma_fb_fu= ll)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tm= a_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_lock_latency / (tma_= dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma_lock_latenc= y + tma_split_loads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * = (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_boun= d + tma_store_bound)) * (tma_split_loads / (tma_dtlb_load + tma_store_fwd_b= lk + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4= k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_= bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * = (tma_split_stores / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store)) + tma_memory_bound * (tma_stor= e_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tm= a_store_bound)) * (tma_store_latency / (tma_store_latency + tma_false_shari= ng + tma_split_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * (tma_fetch_latency * (tma_ms_swi= tches + tma_branch_resteers * (tma_clears_resteers + 10 * tma_microcode_seq= uencer * tma_other_mispredicts / tma_branch_mispredicts * tma_mispredicts_r= esteers) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_br= anches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma= _ms_switches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms /= (tma_mite + tma_dsb + tma_lsd + tma_ms))) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Fed;FetchBW;Frontend", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * (tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_cle= ars_resteers + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_b= ranch_mispredicts * tma_mispredicts_resteers) / (tma_mispredicts_resteers += tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_it= lb_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switc= hes) + tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_lsd + tma_m= s)) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_mis= predicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / = tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_bo= und * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_0)= / (tma_divider + tma_serializing_operation + tma_ports_utilization) + tma_= microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer)= * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound= / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store= _bound)) * (tma_dtlb_store / (tma_store_latency + tma_false_sharing + tma_s= plit_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Offcore", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Ret", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions", "MetricExpr": "tma_light_operations * BR_INST_RETIRED.ALL_BRANCHES= / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_branch_instructions", @@ -147,7 +240,7 @@ "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredic= tions, tma_mispredicts_resteers", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost= , tma_mispredicts_resteers", "ScaleUnit": "100%" }, { @@ -155,8 +248,8 @@ "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -164,8 +257,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", "ScaleUnit": "100%" }, { @@ -173,18 +266,66 @@ "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRE= D.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / t= ma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "MetricExpr": "max(0, tma_icache_misses - tma_code_l2_miss)", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_hit", + "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_COD= E_RD / tma_info_thread_clks", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_miss", + "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_2M_= 4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_4K = / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(29 * tma_info_system_core_frequency * MEM_LOAD_L3_= HIT_RETIRED.XSNP_HITM + 23.5 * tma_info_system_core_frequency * MEM_LOAD_L3= _HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L= 1_MISS / 2) / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricExpr": "((32.5 * tma_info_system_core_frequency - 3.5 * tma= _info_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM + (27 * tm= a_info_system_core_frequency - 3.5 * tma_info_system_core_frequency) * MEM_= LOAD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RE= TIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related= metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_fals= e_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { @@ -194,25 +335,25 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "23.5 * tma_info_system_core_frequency * MEM_LOAD_L3= _HIT_RETIRED.XSNP_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_= MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(27 * tma_info_system_core_frequency - 3.5 * tma_in= fo_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT * (1 + MEM_LOA= D_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT. Related metrics: tma_bottleneck_memory_synchron= ization, tma_contested_accesses, tma_false_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cpu@INS= T_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", + "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%" }, @@ -221,7 +362,7 @@ "MetricExpr": "ARITH.DIVIDER_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, @@ -231,8 +372,8 @@ "MetricExpr": "CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_thread_clk= s + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks - tma_l2_bound", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS_PS", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%" }, { @@ -241,7 +382,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", "ScaleUnit": "100%" }, { @@ -249,44 +390,44 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D1= @ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE= _ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D0= x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYC= LE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store, tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_me= mory_synchronization", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D1@ = + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", + "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D0x1= @ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, = tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_memory_synchroniz= ation", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", "MetricExpr": "32.5 * tma_info_system_core_frequency * OCR.DEMAND_= RFO.L3_HIT.SNOOP_HITM / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing= , tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricExpr": "L1D_PEND_MISS.FB_FULL / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_bottleneck_cache_memory_bandwi= dth, tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store= _latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy, tma_streaming_stores", "ScaleUnit": "100%" }, { @@ -296,7 +437,7 @@ "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LA= TENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_sw= itches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tm= a_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { @@ -306,16 +447,16 @@ "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or up to= ([SNB+] four; [ADL+] five) uops", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", "MetricExpr": "tma_heavy_operations - tma_microcode_sequencer", "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", "MetricName": "tma_few_uops_instructions", "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or up t= o ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number o= f uops in such instructions. Related metrics: tma_decoder0_alone", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions.= Related metrics: tma_decoder0_alone", "ScaleUnit": "100%" }, { @@ -324,7 +465,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", "ScaleUnit": "100%" }, { @@ -333,7 +474,15 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "MetricExpr": "ARITH.FP_DIVIDER_ACTIVE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_fp_divider", + "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -341,7 +490,7 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -350,7 +499,7 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -359,8 +508,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -368,8 +517,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -377,7 +526,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.512B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_512b", - "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -389,17 +538,17 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", - "MetricExpr": "tma_microcode_sequencer + tma_retiring * (UOPS_DECO= DED.DEC0 - cpu@UOPS_DECODED.DEC0\\,cmask\\=3D1@) / IDQ.MITE_UOPS", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "MetricExpr": "tma_microcode_sequencer + tma_retiring * (UOPS_DECO= DED.DEC0 - cpu@UOPS_DECODED.DEC0\\,cmask\\=3D0x1@) / IDQ.MITE_UOPS", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { @@ -407,41 +556,41 @@ "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA sl= ots wasted per non-speculative branch misprediction (retired JEClear)", + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "tma_info_bottleneck_mispredictions * tma_info_threa= d_slots / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 5 / BR_MISP_RETIRED.ALL_BRANCHES / 100", "MetricGroup": "Bad;BrMispredicts;tma_issueBM", "MetricName": "tma_info_bad_spec_branch_misprediction_cost", - "PublicDescription": "Branch Misprediction Cost: Fraction of TMA s= lots wasted per non-speculative branch misprediction (retired JEClear). Rel= ated metrics: tma_branch_mispredicts, tma_info_bottleneck_mispredictions, t= ma_mispredicts_resteers" + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional taken branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_taken < 200" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" }, { - "BriefDescription": "Instructions per retired mispredicts for retu= rn branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_ret", @@ -455,32 +604,11 @@ "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200" }, { - "BriefDescription": "Speculative to Retired ratio of all clears (c= overing mispredicts and nukes)", + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", "MetricGroup": "BrMispredicts", "MetricName": "tma_info_bad_spec_spec_clears_ratio" }, - { - "BriefDescription": "Probability of Core Bound bottleneck hidden b= y SMT-profiling artifacts", - "MetricExpr": "tma_info_botlnk_l0_core_bound_likely", - "MetricGroup": "Cor;Metric;SMT", - "MetricName": "tma_info_botlnk_core_bound_likely", - "MetricThreshold": "tma_info_botlnk_core_bound_likely > 0.5" - }, - { - "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck.", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd))", - "MetricGroup": "DSBmiss;Fed;Scaled_Slots;tma_issueFB", - "MetricName": "tma_info_botlnk_dsb_misses", - "MetricThreshold": "tma_info_botlnk_dsb_misses > 10" - }, - { - "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck.", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", - "MetricGroup": "Fed;FetchLat;IcMiss;Scaled_Slots;tma_issueFL", - "MetricName": "tma_info_botlnk_ic_misses", - "MetricThreshold": "tma_info_botlnk_ic_misses > 5" - }, { "BriefDescription": "Probability of Core Bound bottleneck hidden b= y SMT-profiling artifacts", "MetricConstraint": "NO_GROUP_EVENTS", @@ -491,8 +619,8 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_lsd = + tma_mite)))", - "MetricGroup": "DSB;FetchBW;tma_issueFB", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_lsd + tma_ms)))", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp" @@ -500,7 +628,7 @@ { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_lsd + tma_mite))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd + tma_ms))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -509,108 +637,10 @@ { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", - "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", - "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "tma_info_bottleneck_big_code", - "MetricThreshold": "tma_info_bottleneck_big_code > 20" - }, - { - "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", - "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", - "MetricGroup": "BvBO;Ret", - "MetricName": "tma_info_bottleneck_branching_overhead", - "MetricThreshold": "tma_info_bottleneck_branching_overhead > 5", - "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_= fb_full + tma_l1_hit_latency + tma_lock_latency + tma_split_loads + tma_sto= re_fwd_blk)))", - "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", - "MetricName": "tma_info_bottleneck_cache_memory_bandwidth", - "MetricThreshold": "tma_info_bottleneck_cache_memory_bandwidth > 2= 0", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_store_bound) + tma_memory_bound * (tma_store_bound / (tma_dram_bou= nd + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_= store_latency / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tm= a_store_latency + tma_streaming_stores)) + tma_memory_bound * (tma_l1_bound= / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store= _bound)) * (tma_l1_hit_latency / (tma_4k_aliasing + tma_dtlb_load + tma_fb_= full + tma_l1_hit_latency + tma_lock_latency + tma_split_loads + tma_store_= fwd_blk)))", - "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", - "MetricName": "tma_info_bottleneck_cache_memory_latency", - "MetricThreshold": "tma_info_bottleneck_cache_memory_latency > 20", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" - }, - { - "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_ports_utilization + tma_serializing_operation) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_ports_utilization + tma_serializin= g_operation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", - "MetricGroup": "BvCB;Cor;tma_issueComp", - "MetricName": "tma_info_bottleneck_compute_bound_est", - "MetricThreshold": "tma_info_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * tma_fetch_latency * (tma_ms_swit= ches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteer= s * (10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts)) / (tma_clears_resteers + tma_mispredicts_resteers + tma_unknown_b= ranches)) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + t= ma_itlb_misses + tma_lcp + tma_ms_switches)) - tma_info_bottleneck_big_code= ", - "MetricGroup": "BvFB;Fed;FetchBW;Frontend", - "MetricName": "tma_info_bottleneck_instruction_fetch_bw", - "MetricThreshold": "tma_info_bottleneck_instruction_fetch_bw > 20" - }, - { - "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clea= rs_resteers + tma_mispredicts_resteers * (10 * tma_microcode_sequencer * tm= a_other_mispredicts / tma_branch_mispredicts)) / (tma_clears_resteers + tma= _mispredicts_resteers + tma_unknown_branches)) / (tma_branch_resteers + tma= _dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_swit= ches) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_m= ispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes = / tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_= bound * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_= 0) / (tma_divider + tma_ports_utilization + tma_serializing_operation) + tm= a_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequence= r) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", - "MetricName": "tma_info_bottleneck_irregular_overhead", - "MetricThreshold": "tma_info_bottleneck_irregular_overhead > 10", - "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" - }, - { - "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_4k_aliasing + tm= a_dtlb_load + tma_fb_full + tma_l1_hit_latency + tma_lock_latency + tma_spl= it_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bound / (tma= _dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)= ) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_split_store= s + tma_store_latency + tma_streaming_stores)))", - "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_data_tlbs", - "MetricThreshold": "tma_info_bottleneck_memory_data_tlbs > 20", - "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store, tma_info_bottleneck_memory_synchronization" - }, - { - "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * = tma_false_sharing / (tma_dtlb_store + tma_false_sharing + tma_split_stores = + tma_store_latency + tma_streaming_stores - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", - "MetricGroup": "BvMS;Mem;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_synchronization", - "MetricThreshold": "tma_info_bottleneck_memory_synchronization > 1= 0", - "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_dtlb_load, tma_dtlb_store, tma_info_bottleneck_me= mory_data_tlbs" - }, - { - "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", - "MetricName": "tma_info_bottleneck_mispredictions", - "MetricThreshold": "tma_info_bottleneck_mispredictions > 20", - "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" - }, - { - "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", - "MetricExpr": "100 - (tma_info_bottleneck_big_code + tma_info_bott= leneck_instruction_fetch_bw + tma_info_bottleneck_mispredictions + tma_info= _bottleneck_cache_memory_bandwidth + tma_info_bottleneck_cache_memory_laten= cy + tma_info_bottleneck_memory_data_tlbs + tma_info_bottleneck_memory_sync= hronization + tma_info_bottleneck_compute_bound_est + tma_info_bottleneck_i= rregular_overhead + tma_info_bottleneck_branching_overhead + tma_info_bottl= eneck_useful_work)", - "MetricGroup": "BvOB;Cor;Offcore", - "MetricName": "tma_info_bottleneck_other_bottlenecks", - "MetricThreshold": "tma_info_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls." - }, - { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", - "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "BvUW;Ret", - "MetricName": "tma_info_bottleneck_useful_work", - "MetricThreshold": "tma_info_bottleneck_useful_work > 20" + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5" }, { "BriefDescription": "Fraction of branches that are CALL or RET", @@ -671,11 +701,11 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED.VECTOR) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -688,20 +718,20 @@ "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D1\\,edge@", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost" }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D1\\,edge@", + "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency" }, @@ -737,7 +767,13 @@ "MetricName": "tma_info_frontend_lsd_coverage" }, { - "BriefDescription": "Branch instructions per taken branch.", + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, + { + "BriefDescription": "Branch instructions per taken branch", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -755,7 +791,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -763,7 +799,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -771,7 +807,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit in= struction (lower number means higher occurrence rate)", @@ -779,7 +815,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx512", "MetricThreshold": "tma_info_inst_mix_iparith_avx512 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -787,7 +823,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -795,7 +831,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -840,7 +876,7 @@ }, { "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", - "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umas= k\\=3D0xF@", + "MetricExpr": "INST_RETIRED.ANY / SW_PREFETCH_ACCESS.ANY", "MetricGroup": "Prefetches", "MetricName": "tma_info_inst_mix_ipswpf", "MetricThreshold": "tma_info_inst_mix_ipswpf < 100" @@ -850,21 +886,9 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 11", + "MetricThreshold": "tma_info_inst_mix_iptb < 5 * 2 + 1", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp" }, - { - "BriefDescription": "\"Bus lock\" per kilo instruction", - "MetricExpr": "tma_info_memory_mix_bus_lock_pki", - "MetricGroup": "Mem;Metric", - "MetricName": "tma_info_memory_bus_lock_pki" - }, - { - "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", - "MetricExpr": "tma_info_memory_tlb_code_stlb_mpki", - "MetricGroup": "Fed;MemoryTLB;Metric", - "MetricName": "tma_info_memory_code_stlb_mpki" - }, { "BriefDescription": "Average per-core data fill bandwidth to the L= 1 data cache [GB / sec]", "MetricExpr": "tma_info_memory_l1d_cache_fill_bw", @@ -889,12 +913,6 @@ "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_core_l3_cache_fill_bw_2t" }, - { - "BriefDescription": "Average Parallel L2 cache miss data reads", - "MetricExpr": "tma_info_memory_latency_data_l2_mlp", - "MetricGroup": "Memory_BW;Metric;Offcore", - "MetricName": "tma_info_memory_data_l2_mlp" - }, { "BriefDescription": "Fill Buffer (FB) hits per kilo instructions f= or retired demand loads (L1D misses that merge into ongoing miss-handling e= ntries)", "MetricExpr": "1e3 * MEM_LOAD_RETIRED.FB_HIT / INST_RETIRED.ANY", @@ -903,16 +921,10 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw" }, - { - "BriefDescription": "Average per-core data fill bandwidth to the L= 1 data cache [GB / sec]", - "MetricExpr": "tma_info_memory_l1d_cache_fill_bw", - "MetricGroup": "Core_Metric;Mem;MemoryBW", - "MetricName": "tma_info_memory_l1d_cache_fill_bw_2t" - }, { "BriefDescription": "L1 cache true misses per kilo instruction for= retired demand loads", "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L1_MISS / INST_RETIRED.ANY", @@ -927,16 +939,10 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw" }, - { - "BriefDescription": "Average per-core data fill bandwidth to the L= 2 cache [GB / sec]", - "MetricExpr": "tma_info_memory_l2_cache_fill_bw", - "MetricGroup": "Core_Metric;Mem;MemoryBW", - "MetricName": "tma_info_memory_l2_cache_fill_bw_2t" - }, { "BriefDescription": "L2 cache hits per kilo instruction for all re= quest types (including speculative)", "MetricExpr": "1e3 * (L2_RQSTS.REFERENCES - L2_RQSTS.MISS) / INST_= RETIRED.ANY", @@ -975,28 +981,16 @@ }, { "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", - "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration= _time", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", "MetricGroup": "Mem;MemoryBW;Offcore", "MetricName": "tma_info_memory_l3_cache_access_bw" }, - { - "BriefDescription": "Average per-core data access bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "tma_info_memory_l3_cache_access_bw", - "MetricGroup": "Core_Metric;Mem;MemoryBW;Offcore", - "MetricName": "tma_info_memory_l3_cache_access_bw_2t" - }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw" }, - { - "BriefDescription": "Average per-core data fill bandwidth to the L= 3 cache [GB / sec]", - "MetricExpr": "tma_info_memory_l3_cache_fill_bw", - "MetricGroup": "Core_Metric;Mem;MemoryBW", - "MetricName": "tma_info_memory_l3_cache_fill_bw_2t" - }, { "BriefDescription": "L3 cache true misses per kilo instruction for= retired demand loads", "MetricExpr": "1e3 * MEM_LOAD_RETIRED.L3_MISS / INST_RETIRED.ANY", @@ -1011,52 +1005,22 @@ }, { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", - "MetricExpr": "tma_info_memory_load_l2_miss_latency", - "MetricGroup": "Memory_Lat;Offcore", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency" }, { "BriefDescription": "Average Parallel L2 cache miss demand Loads", - "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D1@", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", "MetricGroup": "Memory_BW;Offcore", "MetricName": "tma_info_memory_latency_load_l2_mlp" }, - { - "BriefDescription": "Average Latency for L3 cache miss demand Load= s", - "MetricExpr": "cpu@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,u= mask\\=3D0x10@ / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", - "MetricName": "tma_info_memory_latency_load_l3_miss_latency" - }, - { - "BriefDescription": "Average Latency for L2 cache miss demand Load= s", - "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", - "MetricGroup": "Clocks_Latency;Memory_Lat;Offcore", - "MetricName": "tma_info_memory_load_l2_miss_latency" - }, - { - "BriefDescription": "Average Parallel L2 cache miss demand Loads", - "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", - "MetricGroup": "Memory_BW;Metric;Offcore", - "MetricName": "tma_info_memory_load_l2_mlp" - }, - { - "BriefDescription": "Average Latency for L3 cache miss demand Load= s", - "MetricExpr": "cpu@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,u= mask\\=3D0x0@ / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", - "MetricGroup": "Clocks_Latency;Memory_Lat;Offcore", - "MetricName": "tma_info_memory_load_l3_miss_latency" - }, { "BriefDescription": "Actual Average Latency for L1 data-cache miss= demand load operations (in core cycles)", "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_RETIRED.L1_MISS += MEM_LOAD_RETIRED.FB_HIT)", "MetricGroup": "Mem;MemoryBound;MemoryLat", "MetricName": "tma_info_memory_load_miss_real_latency" }, - { - "BriefDescription": "STLB (2nd level TLB) data load speculative mi= sses per kilo instruction (misses of any page-size that complete the page w= alk)", - "MetricExpr": "tma_info_memory_tlb_load_stlb_mpki", - "MetricGroup": "Mem;MemoryTLB;Metric", - "MetricName": "tma_info_memory_load_stlb_mpki" - }, { "BriefDescription": "\"Bus lock\" per kilo instruction", "MetricExpr": "1e3 * SQ_MISC.BUS_LOCK / INST_RETIRED.ANY", @@ -1065,7 +1029,7 @@ }, { "BriefDescription": "Un-cacheable retired load per kilo instructio= n", - "MetricExpr": "tma_info_memory_uc_load_pki", + "MetricExpr": "1e3 * MEM_LOAD_MISC_RETIRED.UC / INST_RETIRED.ANY", "MetricGroup": "Mem", "MetricName": "tma_info_memory_mix_uc_load_pki" }, @@ -1077,17 +1041,11 @@ "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)" }, { - "BriefDescription": "Utilization of the core's Page Walker(s) serv= ing STLB misses triggered by instruction/Load/Store accesses", - "MetricExpr": "tma_info_memory_tlb_page_walks_utilization", - "MetricGroup": "Core_Metric;Mem;MemoryTLB", - "MetricName": "tma_info_memory_page_walks_utilization", - "MetricThreshold": "tma_info_memory_page_walks_utilization > 0.5" - }, - { - "BriefDescription": "STLB (2nd level TLB) data store speculative m= isses per kilo instruction (misses of any page-size that complete the page = walk)", - "MetricExpr": "tma_info_memory_tlb_store_stlb_mpki", - "MetricGroup": "Mem;MemoryTLB;Metric", - "MetricName": "tma_info_memory_store_stlb_mpki" + "BriefDescription": "Rate of L2 HW prefetched lines that were not = used by demand accesses", + "MetricExpr": "L2_LINES_OUT.USELESS_HWPF / (L2_LINES_OUT.SILENT + = L2_LINES_OUT.NON_SILENT)", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_memory_prefetches_useless_hwpf", + "MetricThreshold": "tma_info_memory_prefetches_useless_hwpf > 0.15" }, { "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", @@ -1114,15 +1072,9 @@ "MetricGroup": "Mem;MemoryTLB", "MetricName": "tma_info_memory_tlb_store_stlb_mpki" }, - { - "BriefDescription": "Un-cacheable retired load per kilo instructio= n", - "MetricExpr": "1e3 * MEM_LOAD_MISC_RETIRED.UC / INST_RETIRED.ANY", - "MetricGroup": "Mem;Metric", - "MetricName": "tma_info_memory_uc_load_pki" - }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D1@)", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, @@ -1149,18 +1101,18 @@ "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -1178,14 +1130,14 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / duration_time / 1e3", + "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / tma_info_system_time / 1e3", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", - "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_bottlenec= k_cache_memory_bandwidth, tma_mem_bandwidth, tma_sq_full" + "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_bottleneck_cache_memory_ban= dwidth, tma_fb_full, tma_mem_bandwidth, tma_sq_full" }, { "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.8_FLOPS + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1= e9 / duration_time", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.8_FLOPS + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1= e9 / tma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" @@ -1195,13 +1147,14 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi" + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -1224,12 +1177,25 @@ "MetricName": "tma_info_system_mem_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetche= s. ([RKL+]memory-controller only)" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "power@energy\\-pkg@ * 61 / (tma_info_system_time * = 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" + }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for baseline license level 0", "MetricExpr": "CORE_POWER.LVL0_TURBO_LICENSE / tma_info_core_core_= clks", "MetricGroup": "Power", "MetricName": "tma_info_system_power_license0_utilization", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes" }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 1", @@ -1237,7 +1203,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license1_utilization", "MetricThreshold": "tma_info_system_power_license1_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions" }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 2 (introduced in SKX)", @@ -1245,7 +1211,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license2_utilization", "MetricThreshold": "tma_info_system_power_license2_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions" }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", @@ -1259,6 +1225,13 @@ "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -1266,7 +1239,7 @@ "MetricName": "tma_info_system_turbo_utilization" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -1275,14 +1248,15 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi" + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -1292,13 +1266,13 @@ }, { "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "TOPDOWN.SLOTS", + "MetricExpr": "slots", "MetricGroup": "TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots" }, { "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", - "MetricExpr": "(tma_info_thread_slots / (TOPDOWN.SLOTS / 2) if #SM= T_on else 1)", + "MetricExpr": "(tma_info_thread_slots / (slots / 2) if #SMT_on els= e 1)", "MetricGroup": "SMT;TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots_utilization" }, @@ -1314,33 +1288,41 @@ "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 7.5" + "MetricThreshold": "tma_info_thread_uptb < 5 * 1.5" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "MetricExpr": "tma_divider - tma_fp_divider", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_int_divider", + "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB= _HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_micr= ocode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates fraction of cyc= les with demand load accesses that hit the L1 cache", + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", - "MetricName": "tma_l1_hit_latency", - "MetricThreshold": "tma_l1_hit_latency > 0.1 & (tma_l1_bound > 0.1= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles with demand load accesses that hit the L1 cache. The short latency of = the L1 data cache may be exposed in pointer-chasing memory access patterns = as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%" }, { @@ -1349,8 +1331,17 @@ "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + L1D_PEND_MISS.FB_FULL_PERIODS)= * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks)", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT_PS", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "3.5 * tma_info_system_core_frequency * MEM_LOAD_RET= IRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) = / tma_info_thread_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, { @@ -1359,17 +1350,17 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STA= LLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "9 * tma_info_system_core_frequency * (MEM_LOAD_RETI= RED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) = / tma_info_thread_clks", + "MetricExpr": "(12.5 * tma_info_system_core_frequency - 3.5 * tma_= info_system_core_frequency) * (MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETI= RED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_info_bottleneck_cache_memory_latency, tma_mem_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", "ScaleUnit": "100%" }, { @@ -1377,18 +1368,18 @@ "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", "ScaleUnit": "100%" }, { @@ -1405,7 +1396,7 @@ "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1413,16 +1404,40 @@ "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10= * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAN= DING.CYCLES_WITH_DEMAND_RFO))) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, @@ -1432,7 +1447,7 @@ "MetricGroup": "FetchBW;LSD;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_lsd", "MetricThreshold": "tma_lsd > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure", "ScaleUnit": "100%" }, { @@ -1442,16 +1457,16 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_bottleneck_cache= _memory_bandwidth, tma_info_system_dram_bw_use, tma_sq_full", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%" }, { @@ -1459,8 +1474,8 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_info_bottleneck_cache_memory_latency, tma_l3_hit_la= tency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%" }, { @@ -1470,11 +1485,11 @@ "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_= RETIRED.ANY", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", @@ -1488,7 +1503,7 @@ "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", "MetricName": "tma_microcode_sequencer", "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", - "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers,= tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_clears, = tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_bottleneck_irreg= ular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_m= s_switches", "ScaleUnit": "100%" }, { @@ -1496,8 +1511,8 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL= _BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_inf= o_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_branch_mispredicts, tma_info_bad_spec_branch_misprediction_cost= , tma_info_bottleneck_mispredictions", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%" }, { @@ -1511,19 +1526,27 @@ }, { "BriefDescription": "This metric represents fraction of cycles whe= re (only) 4 uops were delivered by the MITE pipeline", - "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=3D4@ - cpu@IDQ.MITE_UO= PS\\,cmask\\=3D5@) / tma_info_thread_clks", + "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=3D0x4@ - cpu@IDQ.MITE_= UOPS\\,cmask\\=3D0x5@) / tma_info_thread_clks", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_mite_gr= oup", "MetricName": "tma_mite_4wide", - "MetricThreshold": "tma_mite_4wide > 0.05 & (tma_mite > 0.1 & tma_= fetch_bandwidth > 0.2)", + "MetricThreshold": "tma_mite_4wide > 0.05 & tma_mite > 0.1 & tma_f= etch_bandwidth > 0.2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", "MetricExpr": "UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY= ", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "MetricExpr": "cpu@IDQ.MS_UOPS\\,cmask\\=3D0x1@ / tma_info_core_co= re_clks / 2", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", + "MetricName": "tma_ms", + "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", "ScaleUnit": "100%" }, { @@ -1531,8 +1554,8 @@ "MetricExpr": "3 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_= clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operat= ion", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, { @@ -1540,7 +1563,7 @@ "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, @@ -1555,19 +1578,19 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { @@ -1603,7 +1626,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_6. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1611,8 +1634,8 @@ "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL)) / tma_= info_thread_clks if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CY= CLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring = * EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", "ScaleUnit": "100%" }, { @@ -1620,8 +1643,8 @@ "MetricExpr": "cpu@EXE_ACTIVITY.3_PORTS_UTIL\\,umask\\=3D0x80@ / t= ma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", "ScaleUnit": "100%" }, { @@ -1629,7 +1652,7 @@ "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", "ScaleUnit": "100%" }, @@ -1638,7 +1661,7 @@ "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, @@ -1647,14 +1670,14 @@ "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_= thread_slots", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", "MetricGroup": "BvUW;Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", @@ -1667,7 +1690,7 @@ "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_thread_clks", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", "ScaleUnit": "100%" }, @@ -1676,7 +1699,7 @@ "MetricExpr": "140 * MISC_RETIRED.PAUSE_INST / tma_info_thread_clk= s", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_slow_pause", - "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: MISC_RETIRED.PAUS= E_INST", "ScaleUnit": "100%" }, @@ -1685,8 +1708,8 @@ "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", "ScaleUnit": "100%" }, { @@ -1695,17 +1718,17 @@ "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", "ScaleUnit": "100%" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", "MetricExpr": "L1D_PEND_MISS.L2_STALL / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_bottleneck_cache_memory_bandwidth, tma_info_system_dram_bw_use, = tma_mem_bandwidth", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%" }, { @@ -1713,8 +1736,8 @@ "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", "ScaleUnit": "100%" }, { @@ -1723,17 +1746,17 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", "MetricExpr": "(L2_RQSTS.RFO_HIT * 10 * (1 - MEM_INST_RETIRED.LOCK= _LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / = MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUEST= S_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -1750,7 +1773,7 @@ "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", "ScaleUnit": "100%" }, { @@ -1758,7 +1781,31 @@ "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1766,7 +1813,7 @@ "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", "MetricName": "tma_streaming_stores", - "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound = > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", "ScaleUnit": "100%" }, @@ -1775,7 +1822,7 @@ "MetricExpr": "10 * BACLEARS.ANY / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, @@ -1784,8 +1831,8 @@ "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", "ScaleUnit": "100%" }, { diff --git a/tools/perf/pmu-events/arch/x86/rocketlake/uncore-interconnect.= json b/tools/perf/pmu-events/arch/x86/rocketlake/uncore-interconnect.json index 3946d4e01a8c..a0057f8d815e 100644 --- a/tools/perf/pmu-events/arch/x86/rocketlake/uncore-interconnect.json +++ b/tools/perf/pmu-events/arch/x86/rocketlake/uncore-interconnect.json @@ -38,16 +38,6 @@ "UMask": "0x2", "Unit": "ARB" }, - { - "BriefDescription": "Number of all coherent Data Read entries. Doe= sn't include prefetches", - "Counter": "1", - "EventCode": "0x81", - "EventName": "UNC_ARB_REQ_TRK_REQUEST.DRD", - "Experimental": "1", - "PerPkg": "1", - "UMask": "0x2", - "Unit": "ARB" - }, { "BriefDescription": "Each cycle counts number of all outgoing vali= d entries in ReqTrk. Such entry is defined as valid from its allocation in = ReqTrk until deallocation. Accounts for Coherent and non-coherent traffic.", "Counter": "0", diff --git a/tools/perf/pmu-events/arch/x86/rocketlake/uncore-other.json b/= tools/perf/pmu-events/arch/x86/rocketlake/uncore-other.json index cc8110ac020c..1ac5b5ef8094 100644 --- a/tools/perf/pmu-events/arch/x86/rocketlake/uncore-other.json +++ b/tools/perf/pmu-events/arch/x86/rocketlake/uncore-other.json @@ -1,6 +1,6 @@ [ { - "BriefDescription": "UNC_CLOCK.SOCKET", + "BriefDescription": "This 48-bit fixed counter counts the UCLK cyc= les.", "Counter": "FIXED", "EventCode": "0xff", "EventName": "UNC_CLOCK.SOCKET", diff --git a/tools/perf/pmu-events/arch/x86/rocketlake/virtual-memory.json = b/tools/perf/pmu-events/arch/x86/rocketlake/virtual-memory.json index 3ff51040f84f..9df790d4361f 100644 --- a/tools/perf/pmu-events/arch/x86/rocketlake/virtual-memory.json +++ b/tools/perf/pmu-events/arch/x86/rocketlake/virtual-memory.json @@ -27,6 +27,15 @@ "SampleAfterValue": "100003", "UMask": "0xe" }, + { + "BriefDescription": "Page walks completed due to a demand data loa= d to a 1G page.", + "Counter": "0,1,2,3", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_1G", + "PublicDescription": "Counts completed page walks (1G sizes) caus= ed by demand data loads. This implies address translations missed in the DT= LB and further levels of TLB. The page walk can end with or without a fault= .", + "SampleAfterValue": "100003", + "UMask": "0x8" + }, { "BriefDescription": "Page walks completed due to a demand data loa= d to a 2M/4M page.", "Counter": "0,1,2,3", @@ -82,6 +91,15 @@ "SampleAfterValue": "100003", "UMask": "0xe" }, + { + "BriefDescription": "Page walks completed due to a demand data sto= re to a 1G page.", + "Counter": "0,1,2,3", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_1G", + "PublicDescription": "Counts completed page walks (1G sizes) caus= ed by demand data stores. This implies address translations missed in the D= TLB and further levels of TLB. The page walk can end with or without a faul= t.", + "SampleAfterValue": "100003", + "UMask": "0x8" + }, { "BriefDescription": "Page walks completed due to a demand data sto= re to a 2M/4M page.", "Counter": "0,1,2,3", --=20 2.48.0.rc2.279.g1de40edade-goog From nobody Sun Dec 14 13:49:51 2025 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EBA5A19E97C for ; Thu, 16 Jan 2025 06:44:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009936; cv=none; b=dCQ1BtvoM/8ZqihNGA0YiKoHCifpJ7yp4xN0OVouqui74ve30rAI53/i0/EwPnu2jwkorZ9nVChASaKBat/d6baH4ZezHxJQgcojsn1q9OaCm05U9MRY8b07fsN8lsB1n2tU0IFsASRo8Phv9ydSXcFv4uC+c3tUzpl4Af4ZFuw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009936; c=relaxed/simple; bh=Fa+bFnlRwhUWJvazRyfLsYlp0fCOihjk+kO+7lW3fZc=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=aMPdo2LjmsSoo+3E7ljn70GZyvjErYngzmTdVlscFbBv+An2oYvcB/vQViA5HVB/kQ3e7/NrNLC+k7N6qETGgjR88bFOcZQDKWtrChV/e4Jo5x7+2jjP/86JJQ+PiEc7BPzOZ/rpdVkT5VPIkcVYZC+Y7+++bJiClkCVMKJ7kyo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=PeiNwfMc; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="PeiNwfMc" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e02fff66a83so1322982276.0 for ; Wed, 15 Jan 2025 22:44:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737009894; x=1737614694; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=M5JJeLZGqe7RCi+hXr8PfvjqeHw0DaU6CmRGuVH68co=; b=PeiNwfMc/oGl7Gx6TwO6lxQq+uxZP4yNLmUMrDLXrhvAXUMuIaTW5EJzztVzG8Lv9r IZdlv+S4g0tJ8r2PWltaQMMWmwV7HR8Z8eGZuuaJoy8YLOHhFf4lA5GLtJXnir//9o5o XnsLGj85+OCWItzfzRsMQymIODM/fYpKWi40sQ0gJvoaQoq1Mjvb7nGuNXcDT9IhTVrh 3EBHAaLYcv9CubkbNuMISoZcaWtqNYNgTlSerzMexdYXaQXovHChQXH5MnpgHtcObtat WSVl7bBZpUntrYjts8ElpPrtVrl/PDlHWkfwUL36/iLeV4PR/e+xXcYfkgV/Fxl2gUQa gyZQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737009894; x=1737614694; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=M5JJeLZGqe7RCi+hXr8PfvjqeHw0DaU6CmRGuVH68co=; b=cZmLM+2sAygA8wujTLti1jDi48JKS3b5OGsazbjBDOqCpuQ0Uqp9NKWSHXHDMoOJua u9JN0R2ofiIM1u5nxom3yyYBHPMpqEDP8B5TcXIkPVDUKrgs+a2jJCu4qH6D/gpMou3t u81OvOYfVnKRXG9D2WlUM8CYkfe7pdjTvu3BUjCwNWdSx942oE4E6K5JB98qTU6ZZcd7 4XKcy7NREQ3ldf/Z+8gTfqZpqjby3DTDC9SLvVxA/G2W8Zs37PjEJnyGpyqyo+EpWoOW KlQmByACNfvHJGAXTGpUYuG873V5QmF0o9YECHCIMVkzYDa03PuQKlcl1z9JO24o0JmU 25Hw== X-Forwarded-Encrypted: i=1; AJvYcCVXNVq6p/VO6SFHk1D4Qi2t2/DdVYVlDj1mdPFIphwbSka5guAygNR/wT7m3usXmKAQv6VhSBbsjGp1+0w=@vger.kernel.org X-Gm-Message-State: AOJu0YxciVsNj7xlcrJidk/Hc4CkWwTlUIGenNi5InIvIyX23GNwwmuk rGE4IK+uRRdUxHhhDKYvUC8Rox5hwlhChZpvB/JGopmU7qtkuhfjfZ1fyQv+BRHXJh0tBD68NN3 vOwg0QQ== X-Google-Smtp-Source: AGHT+IEsuRai3kbJ7mpMx8Nx7lTAAxJdcyE0NNQtUmOGASbcFO8cuM3ch1MYjej2kBM3auQhIyIiS72lsTZg X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:7bd:ce19:c701:bfa3]) (user=irogers job=sendgmr) by 2002:a05:690c:2c90:b0:6f6:d314:49c9 with SMTP id 00721157ae682-6f6d3144af4mr64707b3.8.1737009894227; Wed, 15 Jan 2025 22:44:54 -0800 (PST) Date: Wed, 15 Jan 2025 22:43:51 -0800 In-Reply-To: <20250116064355.344245-1-irogers@google.com> Message-Id: <20250116064355.344245-20-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250116064355.344245-1-irogers@google.com> X-Mailer: git-send-email 2.48.0.rc2.279.g1de40edade-goog Subject: [PATCH v2 19/23] perf vendor events: Update Sapphirerapids events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.23 to v1.25. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v1.25: https://github.com/intel/perfmon/commit/78d6273c546329052429e3a005491b58fbe= 1167b https://github.com/intel/perfmon/commit/f069ed9d0b69b02d76d4b4c59dfc75b62bf= b2254 The TMA 5.01 addition is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Update uncore IIO events umask with the change: https://github.com/intel/perfmon/commit/d78e8a166537c9ceab4f2e901dc96c53667= a2174 which should address an issue originally raised by Michael Petlan: Reported-by: Michael Petlan Closes: https://lore.kernel.org/all/alpine.LRH.2.20.2401300733310.11354@Die= go/ Co-developed-by: Caleb Biggers Signed-off-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- .../arch/x86/sapphirerapids/cache.json | 30 +- .../arch/x86/sapphirerapids/frontend.json | 19 - .../arch/x86/sapphirerapids/memory.json | 15 +- .../arch/x86/sapphirerapids/metricgroups.json | 10 +- .../arch/x86/sapphirerapids/pipeline.json | 23 - .../arch/x86/sapphirerapids/spr-metrics.json | 968 ++++++++++-------- .../arch/x86/sapphirerapids/uncore-io.json | 138 ++- 8 files changed, 631 insertions(+), 574 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 4b21fd6e9b15..9e37d38036a4 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -28,7 +28,7 @@ GenuineIntel-6-1[AEF],v4,nehalemep,core GenuineIntel-6-2E,v4,nehalemex,core GenuineIntel-6-A7,v1.04,rocketlake,core GenuineIntel-6-2A,v19,sandybridge,core -GenuineIntel-6-8F,v1.23,sapphirerapids,core +GenuineIntel-6-8F,v1.25,sapphirerapids,core GenuineIntel-6-AF,v1.04,sierraforest,core GenuineIntel-6-(37|4A|4C|4D|5A),v15,silvermont,core GenuineIntel-6-(4E|5E|8E|9E|A5|A6),v59,skylake,core diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/cache.json b/too= ls/perf/pmu-events/arch/x86/sapphirerapids/cache.json index eec7bf6ebd53..e35dbb7c2ccd 100644 --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/cache.json +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/cache.json @@ -92,11 +92,11 @@ "UMask": "0x2" }, { - "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache when triggered by an L2 cache fill.", + "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache.", "Counter": "0,1,2,3", "EventCode": "0x26", "EventName": "L2_LINES_OUT.SILENT", - "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache when triggered by an L2 cache fill. These lines are ty= pically in Shared or Exclusive state. A non-threaded event.", + "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache. These lines are typically in Shared or Exclusive stat= e. A non-threaded event.", "SampleAfterValue": "200003", "UMask": "0x1" }, @@ -311,7 +311,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_LOADS", - "PEBS": "1", "PublicDescription": "Counts all retired load instructions. This e= vent accounts for SW prefetch instructions of PREFETCHNTA or PREFETCHT0/1/2= or PREFETCHW.", "SampleAfterValue": "1000003", "UMask": "0x81" @@ -322,7 +321,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_STORES", - "PEBS": "1", "PublicDescription": "Counts all retired store instructions.", "SampleAfterValue": "1000003", "UMask": "0x82" @@ -333,7 +331,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts all retired memory instructions - loa= ds and stores.", "SampleAfterValue": "1000003", "UMask": "0x83" @@ -344,7 +341,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.LOCK_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with locked= access.", "SampleAfterValue": "100007", "UMask": "0x21" @@ -355,7 +351,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions that split = across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x41" @@ -366,7 +361,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_STORES", - "PEBS": "1", "PublicDescription": "Counts retired store instructions that split= across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x42" @@ -377,7 +371,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_LOADS", - "PEBS": "1", "PublicDescription": "Number of retired load instructions that (st= art a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x11" @@ -388,7 +381,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_STORES", - "PEBS": "1", "PublicDescription": "Number of retired store instructions that (s= tart a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x12" @@ -408,7 +400,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were HitM responses from shared L3.", "SampleAfterValue": "20011", "UMask": "0x4" @@ -419,7 +410,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS", - "PEBS": "1", "PublicDescription": "Counts the retired load instructions whose d= ata sources were L3 hit and cross-core snoop missed in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x1" @@ -430,7 +420,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NONE", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were hits in L3 without snoops required.", "SampleAfterValue": "100003", "UMask": "0x8" @@ -441,7 +430,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were L3 and cross-core snoop hits in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x2" @@ -452,7 +440,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM", - "PEBS": "1", "PublicDescription": "Retired load instructions which data sources= missed L3 but serviced from local DRAM.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -463,7 +450,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM", - "PEBS": "1", "SampleAfterValue": "1000003", "UMask": "0x2" }, @@ -473,7 +459,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD", - "PEBS": "1", "PublicDescription": "Retired load instructions whose data sources= was forwarded from a remote cache.", "SampleAfterValue": "100007", "UMask": "0x8" @@ -484,7 +469,6 @@ "Data_LA": "1", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM", - "PEBS": "1", "SampleAfterValue": "1000003", "UMask": "0x4" }, @@ -493,7 +477,6 @@ "Counter": "0,1,2,3", "EventCode": "0xd3", "EventName": "MEM_LOAD_L3_MISS_RETIRED.REMOTE_PMM", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with remote= Intel(R) Optane(TM) DC persistent memory as the data source and the data r= equest missed L3.", "SampleAfterValue": "100007", "UMask": "0x10" @@ -504,7 +487,6 @@ "Data_LA": "1", "EventCode": "0xd4", "EventName": "MEM_LOAD_MISC_RETIRED.UC", - "PEBS": "1", "PublicDescription": "Retired instructions with at least one load = to uncacheable memory-type, or at least one cache-line split locked access = (Bus Lock).", "SampleAfterValue": "100007", "UMask": "0x4" @@ -515,7 +497,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.FB_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop was load missed in L1 but hit FB (Fill Buffers) due to preceding= miss to the same cache line with data not ready.", "SampleAfterValue": "100007", "UMask": "0x40" @@ -526,7 +507,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L1 data cache. This event includes all SW prefet= ches and lock instructions regardless of the data source.", "SampleAfterValue": "1000003", "UMask": "0x1" @@ -537,7 +517,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L1 cache.", "SampleAfterValue": "200003", "UMask": "0x8" @@ -548,7 +527,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with L2 cac= he hits as data sources.", "SampleAfterValue": "200003", "UMask": "0x2" @@ -559,7 +537,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions missed L2 c= ache as data sources.", "SampleAfterValue": "100021", "UMask": "0x10" @@ -570,7 +547,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L3 cache.", "SampleAfterValue": "100021", "UMask": "0x4" @@ -581,7 +557,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L3 cache.", "SampleAfterValue": "50021", "UMask": "0x20" @@ -592,7 +567,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.LOCAL_PMM", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with local = Intel(R) Optane(TM) DC persistent memory as the data source and the data re= quest missed L3.", "SampleAfterValue": "1000003", "UMask": "0x80" diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/frontend.json b/= tools/perf/pmu-events/arch/x86/sapphirerapids/frontend.json index f6e3e40a3b20..bf68493d4509 100644 --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/frontend.json +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/frontend.json @@ -41,7 +41,6 @@ "EventName": "FRONTEND_RETIRED.ANY_DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x1", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= DSB (Decode stream buffer i.e. the decoded instruction-cache) miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -53,7 +52,6 @@ "EventName": "FRONTEND_RETIRED.DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x11", - "PEBS": "1", "PublicDescription": "Number of retired Instructions that experien= ced a critical DSB (Decode stream buffer i.e. the decoded instruction-cache= ) miss. Critical means stalls were exposed to the back-end as a result of t= he DSB miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -65,7 +63,6 @@ "EventName": "FRONTEND_RETIRED.ITLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x14", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= iTLB (Instruction TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -77,7 +74,6 @@ "EventName": "FRONTEND_RETIRED.L1I_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x12", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L1 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -89,7 +85,6 @@ "EventName": "FRONTEND_RETIRED.L2_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x13", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L2 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -101,7 +96,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x600106", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 1 cycle which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -113,7 +107,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_128", "MSRIndex": "0x3F7", "MSRValue": "0x608006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 12= 8 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -125,7 +118,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_16", "MSRIndex": "0x3F7", "MSRValue": "0x601006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 16 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -137,7 +129,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2", "MSRIndex": "0x3F7", "MSRValue": "0x600206", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -149,7 +140,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_256", "MSRIndex": "0x3F7", "MSRValue": "0x610006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 25= 6 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -161,7 +151,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x100206", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after the front-end had at least 1 bubble-slot for a per= iod of 2 cycles. A bubble-slot is an empty issue-pipeline slot while there = was no RAT stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -173,7 +162,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_32", "MSRIndex": "0x3F7", "MSRValue": "0x602006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 32 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -185,7 +173,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_4", "MSRIndex": "0x3F7", "MSRValue": "0x600406", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 4 = cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -197,7 +184,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_512", "MSRIndex": "0x3F7", "MSRValue": "0x620006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 51= 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -209,7 +195,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_64", "MSRIndex": "0x3F7", "MSRValue": "0x604006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 64= cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -221,7 +206,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_8", "MSRIndex": "0x3F7", "MSRValue": "0x600806", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 8 cycles. During thi= s period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -233,7 +217,6 @@ "EventName": "FRONTEND_RETIRED.MS_FLOWS", "MSRIndex": "0x3F7", "MSRValue": "0x8", - "PEBS": "1", "SampleAfterValue": "100007", "UMask": "0x1" }, @@ -244,7 +227,6 @@ "EventName": "FRONTEND_RETIRED.STLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x15", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= STLB (2nd level TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -256,7 +238,6 @@ "EventName": "FRONTEND_RETIRED.UNKNOWN_BRANCH", "MSRIndex": "0x3F7", "MSRValue": "0x17", - "PEBS": "1", "SampleAfterValue": "100007", "UMask": "0x1" }, diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/memory.json b/to= ols/perf/pmu-events/arch/x86/sapphirerapids/memory.json index 2ea19539291b..41d4120d4dae 100644 --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/memory.json +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/memory.json @@ -63,7 +63,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_1024", "MSRIndex": "0x3F6", "MSRValue": "0x400", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 1024 cycles. Reporte= d latency may be longer than just the memory latency.", "SampleAfterValue": "53", "UMask": "0x1" @@ -76,7 +75,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_128", "MSRIndex": "0x3F6", "MSRValue": "0x80", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 128 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "1009", "UMask": "0x1" @@ -89,7 +87,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_16", "MSRIndex": "0x3F6", "MSRValue": "0x10", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 16 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "20011", "UMask": "0x1" @@ -102,7 +99,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_256", "MSRIndex": "0x3F6", "MSRValue": "0x100", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 256 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "503", "UMask": "0x1" @@ -115,7 +111,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32", "MSRIndex": "0x3F6", "MSRValue": "0x20", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 32 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -128,7 +123,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4", "MSRIndex": "0x3F6", "MSRValue": "0x4", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 4 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "100003", "UMask": "0x1" @@ -141,7 +135,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_512", "MSRIndex": "0x3F6", "MSRValue": "0x200", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 512 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "101", "UMask": "0x1" @@ -154,7 +147,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64", "MSRIndex": "0x3F6", "MSRValue": "0x40", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 64 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "2003", "UMask": "0x1" @@ -167,7 +159,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_8", "MSRIndex": "0x3F6", "MSRValue": "0x8", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 8 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "50021", "UMask": "0x1" @@ -178,7 +169,6 @@ "Data_LA": "1", "EventCode": "0xcd", "EventName": "MEM_TRANS_RETIRED.STORE_SAMPLE", - "PEBS": "2", "PublicDescription": "Counts Retired memory accesses with at least= 1 store operation. This PEBS event is the precisely-distributed (PDist) tr= igger covering all stores uops for sampling by the PEBS Store Latency Facil= ity. The facility is described in Intel SDM Volume 3 section 19.9.8", "SampleAfterValue": "1000003", "UMask": "0x2" @@ -305,17 +295,16 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc9", "EventName": "RTM_RETIRED.ABORTED", - "PEBS": "1", "PublicDescription": "Counts the number of times RTM abort was tri= ggered.", "SampleAfterValue": "100003", "UMask": "0x4" }, { - "BriefDescription": "Number of times an RTM execution aborted due = to none of the previous 4 categories (e.g. interrupt)", + "BriefDescription": "Number of times an RTM execution aborted due = to none of the previous 3 categories (e.g. interrupt)", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc9", "EventName": "RTM_RETIRED.ABORTED_EVENTS", - "PublicDescription": "Counts the number of times an RTM execution = aborted due to none of the previous 4 categories (e.g. interrupt).", + "PublicDescription": "Counts the number of times an RTM execution = aborted due to none of the previous 3 categories (e.g. interrupt).", "SampleAfterValue": "100003", "UMask": "0x80" }, diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/metricgroups.jso= n b/tools/perf/pmu-events/arch/x86/sapphirerapids/metricgroups.json index e1de6c2675c4..9129fb7b7ce4 100644 --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/metricgroups.json @@ -40,6 +40,7 @@ "IoBW": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -86,7 +87,9 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_mispredicts_group": "Metrics contributing to tma_branch_mi= spredicts category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", + "tma_code_stlb_miss_group": "Metrics contributing to tma_code_stlb_mis= s category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -96,6 +99,7 @@ "tma_fp_vector_group": "Metrics contributing to tma_fp_vector category= ", "tma_frontend_bound_group": "Metrics contributing to tma_frontend_boun= d category", "tma_heavy_operations_group": "Metrics contributing to tma_heavy_opera= tions category", + "tma_icache_misses_group": "Metrics contributing to tma_icache_misses = category", "tma_int_operations_group": "Metrics contributing to tma_int_operation= s category", "tma_issue2P": "Metrics related by the issue $issue2P", "tma_issueBM": "Metrics related by the issue $issueBM", @@ -116,10 +120,13 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", + "tma_l2_bound_group": "Metrics contributing to tma_l2_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", "tma_load_op_utilization_group": "Metrics contributing to tma_load_op_= utilization category", + "tma_load_stlb_miss_group": "Metrics contributing to tma_load_stlb_mis= s category", "tma_machine_clears_group": "Metrics contributing to tma_machine_clear= s category", "tma_mem_bandwidth_group": "Metrics contributing to tma_mem_bandwidth = category", "tma_mem_latency_group": "Metrics contributing to tma_mem_latency cate= gory", @@ -133,5 +140,6 @@ "tma_retiring_group": "Metrics contributing to tma_retiring category", "tma_serializing_operation_group": "Metrics contributing to tma_serial= izing_operation category", "tma_store_bound_group": "Metrics contributing to tma_store_bound cate= gory", - "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category" + "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category", + "tma_store_stlb_miss_group": "Metrics contributing to tma_store_stlb_m= iss category" } diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/pipeline.json b/= tools/perf/pmu-events/arch/x86/sapphirerapids/pipeline.json index 5d5811f26151..50cacfbbc7cf 100644 --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/pipeline.json @@ -62,7 +62,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all branch instructions retired.", "SampleAfterValue": "400009" }, @@ -71,7 +70,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts conditional branch instructions retir= ed.", "SampleAfterValue": "400009", "UMask": "0x11" @@ -81,7 +79,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts not taken branch instructions retired= .", "SampleAfterValue": "400009", "UMask": "0x10" @@ -91,7 +88,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional branch instructions= retired.", "SampleAfterValue": "400009", "UMask": "0x1" @@ -101,7 +97,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.FAR_BRANCH", - "PEBS": "1", "PublicDescription": "Counts far branch instructions retired.", "SampleAfterValue": "100007", "UMask": "0x40" @@ -111,7 +106,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts near indirect branch instructions ret= ired excluding returns. TSX abort is an indirect branch.", "SampleAfterValue": "100003", "UMask": "0x80" @@ -121,7 +115,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_CALL", - "PEBS": "1", "PublicDescription": "Counts both direct and indirect near call in= structions retired.", "SampleAfterValue": "100007", "UMask": "0x2" @@ -131,7 +124,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_RETURN", - "PEBS": "1", "PublicDescription": "Counts return instructions retired.", "SampleAfterValue": "100007", "UMask": "0x8" @@ -141,7 +133,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken branch instructions retired.", "SampleAfterValue": "400009", "UMask": "0x20" @@ -151,7 +142,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all the retired branch instructions t= hat were mispredicted by the processor. A branch misprediction occurs when = the processor incorrectly predicts the destination of the branch. When the= misprediction is discovered at execution, all the instructions executed in= the wrong (speculative) path must be discarded, and the processor must sta= rt fetching from the correct path.", "SampleAfterValue": "400009" }, @@ -160,7 +150,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts mispredicted conditional branch instr= uctions retired.", "SampleAfterValue": "400009", "UMask": "0x11" @@ -170,7 +159,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts the number of conditional branch inst= ructions retired that were mispredicted and the branch direction was not ta= ken.", "SampleAfterValue": "400009", "UMask": "0x10" @@ -180,7 +168,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional mispredicted branch= instructions retired.", "SampleAfterValue": "400009", "UMask": "0x1" @@ -190,7 +177,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts miss-predicted near indirect branch i= nstructions retired excluding returns. TSX abort is an indirect branch.", "SampleAfterValue": "100003", "UMask": "0x80" @@ -200,7 +186,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT_CALL", - "PEBS": "1", "PublicDescription": "Counts retired mispredicted indirect (near t= aken) CALL instructions, including both register and memory indirect.", "SampleAfterValue": "400009", "UMask": "0x2" @@ -210,7 +195,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts number of near branch instructions re= tired that were mispredicted and taken.", "SampleAfterValue": "400009", "UMask": "0x20" @@ -220,7 +204,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.RET", - "PEBS": "1", "PublicDescription": "This is a non-precise version (that is, does= not use PEBS) of the event that counts mispredicted return instructions re= tired.", "SampleAfterValue": "100007", "UMask": "0x8" @@ -469,7 +452,6 @@ "BriefDescription": "Number of instructions retired. Fixed Counter= - architectural event", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003", "UMask": "0x1" @@ -479,7 +461,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.ANY_P", - "PEBS": "1", "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003" }, @@ -488,7 +469,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.MACRO_FUSED", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x10" }, @@ -497,7 +477,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.NOP", - "PEBS": "1", "PublicDescription": "Counts all retired NOP or ENDBR32/64 instruc= tions", "SampleAfterValue": "2000003", "UMask": "0x2" @@ -506,7 +485,6 @@ "BriefDescription": "Precise instruction retired with PEBS precise= -distribution", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.PREC_DIST", - "PEBS": "1", "PublicDescription": "A version of INST_RETIRED that allows for a = precise distribution of samples across instructions retired. It utilizes th= e Precise Distribution of Instructions Retired (PDIR++) feature to fix bias= in how retired instructions get sampled. Use on Fixed Counter 0.", "SampleAfterValue": "2000003", "UMask": "0x1" @@ -516,7 +494,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.REP_ITERATION", - "PEBS": "1", "PublicDescription": "Number of iterations of Repeat (REP) string = retired instructions such as MOVS, CMPS, and SCAS. Each has a byte, word, a= nd doubleword version and string instructions can be repeated using a repet= ition prefix, REP, that allows their architectural execution to be repeated= a number of times as specified by the RCX register. Note the number of ite= rations is implementation-dependent.", "SampleAfterValue": "2000003", "UMask": "0x8" diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json= b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json index 2b3b013ccb06..37584a8a75aa 100644 --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json @@ -34,7 +34,7 @@ "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles.", + "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles", "MetricExpr": "CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY", "MetricName": "cpi", "ScaleUnit": "1per_instr" @@ -55,85 +55,85 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte page sizes) caused by demand data loads to the total number of c= ompleted instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRE= D.ANY", "MetricName": "dtlb_2nd_level_2mb_large_page_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte page sizes) caused by demand data loads to the total number of = completed instructions. This implies it missed in the Data Translation Look= aside Buffer (DTLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte page sizes) caused by demand data loads to the total number of = completed instructions. This implies it missed in the Data Translation Look= aside Buffer (DTLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data loads to the total number of complete= d instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "dtlb_2nd_level_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data stores to the total number of complet= ed instructions", "MetricExpr": "DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", "MetricName": "dtlb_2nd_level_store_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB", "ScaleUnit": "1per_instr" }, { - "BriefDescription": "Bandwidth observed by the integrated I/O traf= fic controller (IIO) of IO reads that are initiated by end device controlle= rs that are requesting memory from the CPU.", + "BriefDescription": "Bandwidth observed by the integrated I/O traf= fic controller (IIO) of IO reads that are initiated by end device controlle= rs that are requesting memory from the CPU", "MetricExpr": "UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.ALL_PARTS * 4 / 1e= 6 / duration_time", "MetricName": "iio_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth observed by the integrated I/O traf= fic controller (IIO) of IO writes that are initiated by end device controll= ers that are writing memory to the CPU.", + "BriefDescription": "Bandwidth observed by the integrated I/O traf= fic controller (IIO) of IO writes that are initiated by end device controll= ers that are writing memory to the CPU", "MetricExpr": "UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.ALL_PARTS * 4 / 1= e6 / duration_time", "MetricName": "iio_bandwidth_write", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU.", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e6 / durati= on_time", "MetricName": "io_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the local CPU socket.= ", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the local CPU socket", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR_LOCAL * 64 / 1e6 / = duration_time", "MetricName": "io_bandwidth_read_local", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from a remote CPU socket.", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from a remote CPU socket", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR_REMOTE * 64 / 1e6 /= duration_time", "MetricName": "io_bandwidth_read_remote", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM + UNC_CHA_TOR_INSERTS.= IO_ITOMCACHENEAR) * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_write", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the local CPU socket.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the local CPU socket", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM_LOCAL + UNC_CHA_TOR_IN= SERTS.IO_ITOMCACHENEAR_LOCAL) * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_write_local", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to a remote CPU socket.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to a remote CPU socket", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM_REMOTE + UNC_CHA_TOR_I= NSERTS.IO_ITOMCACHENEAR_REMOTE) * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_write_remote", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Percentage of inbound full cacheline writes i= nitiated by end device controllers that miss the L3 cache.", + "BriefDescription": "Percentage of inbound full cacheline writes i= nitiated by end device controllers that miss the L3 cache", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_MISS_ITOM / UNC_CHA_TOR_INSE= RTS.IO_ITOM", "MetricName": "io_percent_of_inbound_full_writes_that_miss_l3", "ScaleUnit": "100%" }, { - "BriefDescription": "Percentage of inbound partial cacheline write= s initiated by end device controllers that miss the L3 cache.", + "BriefDescription": "Percentage of inbound partial cacheline write= s initiated by end device controllers that miss the L3 cache", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_MISS_ITOMCACHENEAR + UNC_CH= A_TOR_INSERTS.IO_MISS_RFO) / (UNC_CHA_TOR_INSERTS.IO_ITOMCACHENEAR + UNC_CH= A_TOR_INSERTS.IO_RFO)", "MetricName": "io_percent_of_inbound_partial_writes_that_miss_l3", "ScaleUnit": "100%" }, { - "BriefDescription": "Percentage of inbound reads initiated by end = device controllers that miss the L3 cache.", + "BriefDescription": "Percentage of inbound reads initiated by end = device controllers that miss the L3 cache", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_MISS_PCIRDCUR / UNC_CHA_TOR_= INSERTS.IO_PCIRDCUR", "MetricName": "io_percent_of_inbound_reads_that_miss_l3", "ScaleUnit": "100%" @@ -142,14 +142,14 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total n= umber of completed instructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY= ", "MetricName": "itlb_2nd_level_large_page_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by a code fetch to the total number of completed ins= tructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "itlb_2nd_level_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB", "ScaleUnit": "1per_instr" }, { @@ -237,25 +237,25 @@ "ScaleUnit": "1ns" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to local memory.", + "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to local memory", "MetricExpr": "UNC_CHA_REQUESTS.READS_LOCAL * 64 / 1e6 / duration_= time", "MetricName": "llc_miss_local_memory_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to local memory.", + "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to local memory", "MetricExpr": "UNC_CHA_REQUESTS.WRITES_LOCAL * 64 / 1e6 / duration= _time", "MetricName": "llc_miss_local_memory_bandwidth_write", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to remote memory.", + "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to remote memory", "MetricExpr": "UNC_CHA_REQUESTS.READS_REMOTE * 64 / 1e6 / duration= _time", "MetricName": "llc_miss_remote_memory_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to remote memory.", + "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to remote memory", "MetricExpr": "UNC_CHA_REQUESTS.WRITES_REMOTE * 64 / 1e6 / duratio= n_time", "MetricName": "llc_miss_remote_memory_bandwidth_write", "ScaleUnit": "1MB/s" @@ -285,19 +285,19 @@ "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Memory write bandwidth (MB/sec) caused by dir= ectory updates; includes DDR and Intel(R) Optane(TM) Persistent Memory(PMEM= ).", + "BriefDescription": "Memory write bandwidth (MB/sec) caused by dir= ectory updates; includes DDR and Intel(R) Optane(TM) Persistent Memory(PMEM= )", "MetricExpr": "(UNC_CHA_DIR_UPDATE.HA + UNC_CHA_DIR_UPDATE.TOR + U= NC_M2M_DIRECTORY_UPDATE.ANY) * 64 / 1e6 / duration_time", "MetricName": "memory_extra_write_bw_due_to_directory_updates", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches.", + "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCAL + UNC_CHA_TO= R_INSERTS.IA_MISS_DRD_PREF_LOCAL) / (UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCAL = + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_LOCAL + UNC_CHA_TOR_INSERTS.IA_MISS_= DRD_REMOTE + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_REMOTE)", "MetricName": "numa_reads_addressed_to_local_dram", "ScaleUnit": "100%" }, { - "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches.", + "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_DRD_REMOTE + UNC_CHA_T= OR_INSERTS.IA_MISS_DRD_PREF_REMOTE) / (UNC_CHA_TOR_INSERTS.IA_MISS_DRD_LOCA= L + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_LOCAL + UNC_CHA_TOR_INSERTS.IA_MIS= S_DRD_REMOTE + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_PREF_REMOTE)", "MetricName": "numa_reads_addressed_to_remote_dram", "ScaleUnit": "100%" @@ -360,7 +360,7 @@ "ScaleUnit": "1per_instr" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + = UOPS_DISPATCHED.PORT_5_11 + UOPS_DISPATCHED.PORT_6) / (5 * tma_info_core_co= re_clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -372,7 +372,7 @@ "MetricExpr": "EXE.AMX_BUSY / tma_info_core_core_clks", "MetricGroup": "BvCB;Compute;HPC;Server;TopdownL3;tma_L3_group;tma= _core_bound_group", "MetricName": "tma_amx_busy", - "MetricThreshold": "tma_amx_busy > 0.5 & (tma_core_bound > 0.1 & t= ma_backend_bound > 0.2)", + "MetricThreshold": "tma_amx_busy > 0.5 & tma_core_bound > 0.1 & tm= a_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -380,12 +380,12 @@ "MetricExpr": "78 * ASSISTS.ANY / tma_info_thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists.", + "BriefDescription": "This metric estimates fraction of slots the C= PU retired uops as a result of handing SSE to AVX* or AVX* to SSE transitio= n Assists", "MetricExpr": "63 * ASSISTS.SSE_AVX_MIX / tma_info_thread_slots", "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_avx_assists", @@ -394,35 +394,125 @@ }, { "BriefDescription": "This category represents fraction of slots wh= ere no uops are being delivered due to a lack of required resources for acc= epting new uops in the Backend", - "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_inf= o_thread_slots", - "MetricGroup": "BvOB;Default;TmaL1;TopdownL1;tma_L1_group", + "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "BvOB;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", - "MetricgroupNoGroup": "TopdownL1;Default", + "MetricgroupNoGroup": "TopdownL1", "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots wa= sted due to incorrect speculations", - "DefaultMetricgroupName": "TopdownL1", "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + t= ma_retiring), 0)", - "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group", + "MetricGroup": "TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", - "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%" }, + { + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Ret", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_fb_full)= ))", + "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_= bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_d= ram_bound + tma_store_bound)) * (tma_lock_latency / (tma_dtlb_load + tma_st= ore_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_load= s + tma_fb_full)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_= l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_split_l= oads / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma= _lock_latency + tma_split_loads + tma_fb_full)) + tma_memory_bound * (tma_s= tore_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_split_stores / (tma_store_latency + tma_false_sha= ring + tma_split_stores + tma_streaming_stores + tma_dtlb_store)) + tma_mem= ory_bound * (tma_store_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound = + tma_dram_bound + tma_store_bound)) * (tma_store_latency / (tma_store_late= ncy + tma_false_sharing + tma_split_stores + tma_streaming_stores + tma_dtl= b_store)))", + "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_amx_busy + tma_ports_utilization) + tma_c= ore_bound * tma_amx_busy / (tma_divider + tma_serializing_operation + tma_a= mx_busy + tma_ports_utilization) + tma_core_bound * (tma_ports_utilization = / (tma_divider + tma_serializing_operation + tma_amx_busy + tma_ports_utili= zation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_utili= zed_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.MS\\,cmask\\=3D0x1@) * (tma_fe= tch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_resteers= + tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispredict= s) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_branches= )) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_sw= itches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms / (tma_= mite + tma_dsb + tma_ms))) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Fed;FetchBW;Frontend", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu@UOPS_R= ETIRED.MS\\,cmask\\=3D0x1@) * (tma_fetch_latency * (tma_ms_switches + tma_b= ranch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma_othe= r_mispredicts / tma_branch_mispredicts) / (tma_mispredicts_resteers + tma_c= lears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_itlb_mis= ses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) += tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_ms)) + 10 * tma_m= icrocode_sequencer * tma_other_mispredicts / tma_branch_mispredicts * tma_b= ranch_mispredicts + tma_machine_clears * tma_other_nukes / tma_other_nukes = + tma_core_bound * (tma_serializing_operation + RS.EMPTY_RESOURCE / tma_inf= o_thread_clks * tma_ports_utilized_0) / (tma_divider + tma_serializing_oper= ation + tma_amx_busy + tma_ports_utilization) + tma_microcode_sequencer / (= tma_few_uops_instructions + tma_microcode_sequencer) * (tma_assists / tma_m= icrocode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_bound += tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_dt= lb_store / (tma_store_latency + tma_false_sharing + tma_split_stores + tma_= streaming_stores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) * tma_remote_cach= e / (tma_local_mem + tma_remote_mem + tma_remote_cache) + tma_l3_bound / (t= ma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_boun= d) * (tma_contested_accesses + tma_data_sharing) / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / = (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bo= und) * tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_spl= it_stores + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + t= ma_machine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears, tma_remote_cache" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Offcore", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Ret", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20" + }, { "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", - "DefaultMetricgroupName": "TopdownL2", - "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tm= a_info_thread_slots", - "MetricGroup": "BadSpec;BrMispredicts;BvMP;Default;TmaL2;TopdownL2= ;tma_L2_group;tma_bad_speculation_group;tma_issueBM", + "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound += topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * sl= ots", + "MetricGroup": "BadSpec;BrMispredicts;BvMP;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueBM", "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredict= ions, tma_mispredicts_resteers", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: = tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost,= tma_mispredicts_resteers", "ScaleUnit": "100%" }, { @@ -430,24 +520,24 @@ "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings).", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.1 power-performance optimized state (Fas= ter wakeup time; Smaller power savings)", "MetricExpr": "CPU_CLK_UNHALTED.C01 / tma_info_thread_clks", "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", "MetricName": "tma_c01_wait", - "MetricThreshold": "tma_c01_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_c01_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings).", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due staying in C0.2 power-performance optimized state (Slo= wer wakeup time; Larger power savings)", "MetricExpr": "CPU_CLK_UNHALTED.C02 / tma_info_thread_clks", "MetricGroup": "C0Wait;TopdownL4;tma_L4_group;tma_serializing_oper= ation_group", "MetricName": "tma_c02_wait", - "MetricThreshold": "tma_c02_wait > 0.05 & (tma_serializing_operati= on > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_c02_wait > 0.05 & tma_serializing_operatio= n > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -455,7 +545,7 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources. Sample with: FRONTEND_RETIRE= D.MS_FLOWS", "ScaleUnit": "100%" }, @@ -464,45 +554,92 @@ "MetricExpr": "(1 - tma_branch_mispredicts / tma_bad_speculation) = * INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "MetricExpr": "max(0, tma_icache_misses - tma_code_l2_miss)", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_hit", + "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_COD= E_RD / tma_info_thread_clks", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_miss", + "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_2M_= 4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_4K = / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", - "MetricExpr": "(76.6 * tma_info_system_core_frequency * (MEM_LOAD_= L3_HIT_RETIRED.XSNP_FWD * (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMA= ND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD= ))) + 74.6 * tma_info_system_core_frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_= MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_= info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricExpr": "((81 * tma_info_system_core_frequency - 4.4 * tma_i= nfo_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (OCR.DEMAN= D_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.D= EMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) + (79 * tma_info_system_core_fre= quency - 4.4 * tma_info_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XS= NP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / t= ma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related m= etrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma_remote= _cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false= _sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots wher= e Core non-memory issues were of a bottleneck", - "DefaultMetricgroupName": "TopdownL2", "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)", - "MetricGroup": "Backend;Compute;Default;TmaL2;TopdownL2;tma_L2_gro= up;tma_backend_bound_group", + "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_b= ackend_bound_group", "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", - "MetricExpr": "74.6 * tma_info_system_core_frequency * (MEM_LOAD_L= 3_HIT_RETIRED.XSNP_NO_FWD + MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (1 - OCR.DEM= AND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR= .DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT= / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(79 * tma_info_system_core_frequency - 4.4 * tma_in= fo_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD + MEM_LOAD= _L3_HIT_RETIRED.XSNP_FWD * (1 - OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR= .DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WIT= H_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / t= ma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears, = tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cpu@INS= T_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", + "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%" }, @@ -511,17 +648,17 @@ "MetricExpr": "ARITH.DIV_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates how often the CPU was s= talled on accesses to external memory (DRAM) by loads", - "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L3_MISS / tma_info_thread_c= lks - tma_pmm_bound if #has_pmem > 0 else MEMORY_ACTIVITY.STALLS_L3_MISS / = tma_info_thread_clks)", + "MetricExpr": "MEMORY_ACTIVITY.STALLS_L3_MISS / tma_info_thread_cl= ks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS_PS", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%" }, { @@ -530,7 +667,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", "ScaleUnit": "100%" }, { @@ -538,75 +675,73 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D1= @ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - MEMOR= Y_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D0= x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - MEM= ORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store, tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_me= mory_synchronization", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D1@ = + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", + "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D0x1= @ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, = tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_memory_synchroniz= ation", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", - "MetricExpr": "81 * tma_info_system_core_frequency * OCR.DEMAND_RF= O.L3_HIT.SNOOP_HITM / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricExpr": "(170 * tma_info_system_core_frequency * cpu@OCR.DEM= AND_RFO.L3_MISS\\,offcore_rsp\\=3D0x103b800002@ + 81 * tma_info_system_core= _frequency * OCR.DEMAND_RFO.L3_HIT.SNOOP_HITM) / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing= , tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears, tma_remote_cac= he", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricExpr": "L1D_PEND_MISS.FB_FULL / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_bottleneck_cache_memory_bandwi= dth, tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store= _latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy, tma_streaming_stores", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend bandwidth issues", - "DefaultMetricgroupName": "TopdownL2", "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)", - "MetricGroup": "Default;FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_gr= oup;tma_frontend_bound_group;tma_issueFB", + "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_= frontend_bound_group;tma_issueFB", "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LA= TENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_sw= itches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tm= a_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots the = CPU was stalled due to Frontend latency issues", - "DefaultMetricgroupName": "TopdownL2", "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.U= OP_DROPPING / tma_info_thread_slots", - "MetricGroup": "Default;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_= frontend_bound_group", + "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend= _bound_group", "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or up to= ([SNB+] four; [ADL+] five) uops", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", "MetricExpr": "max(0, tma_heavy_operations - tma_microcode_sequenc= er)", "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", "MetricName": "tma_few_uops_instructions", "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or up t= o ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number o= f uops in such instructions. Related metrics: tma_decoder0_alone", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions.= Related metrics: tma_decoder0_alone", "ScaleUnit": "100%" }, { @@ -615,7 +750,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", "ScaleUnit": "100%" }, { @@ -624,7 +759,15 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "MetricExpr": "ARITH.FPDIV_ACTIVE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_fp_divider", + "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -632,8 +775,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED2.SCALAR) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_2= 56b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vector_2= 56b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -641,8 +784,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.VECTOR + FP_ARITH_INST_RETIR= ED2.VECTOR) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6= , tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -650,8 +793,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.128B_PACKED_HALF= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_= 2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized= _2", "ScaleUnit": "100%" }, { @@ -659,8 +802,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.256B_PACKED_HALF= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_= 2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_int_vector_128b, = tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized= _2", "ScaleUnit": "100%" }, { @@ -668,39 +811,37 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.512B_PACKED_SINGLE + FP_ARITH_INST_RETIRED2.512B_PACKED_HALF= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_512b", - "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_= 2", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_int_vector_128b, tma_int_vecto= r_256b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots wh= ere the processor's Frontend undersupplies its Backend", - "DefaultMetricgroupName": "TopdownL1", "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topd= own\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UO= P_DROPPING / tma_info_thread_slots", - "MetricGroup": "BvFB;BvIO;Default;PGO;TmaL1;TopdownL1;tma_L1_group= ", + "MetricGroup": "BvFB;BvIO;PGO;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", - "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", + "MetricgroupNoGroup": "TopdownL1", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions -- where one uop can represent mu= ltiple contiguous instructions", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", "MetricExpr": "tma_light_operations * INST_RETIRED.MACRO_FUSED / (= tma_retiring * tma_info_thread_slots)", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_fused_instructions", "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions -- where one uop can represent m= ultiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of = legacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Ot= her_Light_Ops in MTL!)}", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", - "DefaultMetricgroupName": "TopdownL2", - "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_in= fo_thread_slots", - "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_re= tiring_group", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .). Sample with: UOPS= _RETIRED.HEAVY", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+]). Sample with: UOPS_RET= IRED.HEAVY", "ScaleUnit": "100%" }, { @@ -708,40 +849,40 @@ "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA sl= ots wasted per non-speculative branch misprediction (retired JEClear)", - "MetricExpr": "tma_info_bottleneck_mispredictions * tma_info_threa= d_slots / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 6 / BR_MISP_RETIRED.ALL_BRANCHES / 100", "MetricGroup": "Bad;BrMispredicts;tma_issueBM", "MetricName": "tma_info_bad_spec_branch_misprediction_cost", - "PublicDescription": "Branch Misprediction Cost: Fraction of TMA s= lots wasted per non-speculative branch misprediction (retired JEClear). Rel= ated metrics: tma_branch_mispredicts, tma_info_bottleneck_mispredictions, t= ma_mispredicts_resteers" + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional taken branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_taken < 200" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" }, { - "BriefDescription": "Instructions per retired mispredicts for retu= rn branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_ret", @@ -755,7 +896,7 @@ "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200" }, { - "BriefDescription": "Speculative to Retired ratio of all clears (c= overing mispredicts and nukes)", + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", "MetricGroup": "BrMispredicts", "MetricName": "tma_info_bad_spec_spec_clears_ratio" @@ -769,15 +910,15 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_mite= )))", - "MetricGroup": "DSB;FetchBW;tma_issueFB", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_ms)))", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp" }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_mite))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_ms))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -785,104 +926,10 @@ }, { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", - "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", - "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "tma_info_bottleneck_big_code", - "MetricThreshold": "tma_info_bottleneck_big_code > 20" - }, - { - "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", - "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", - "MetricGroup": "BvBO;Ret", - "MetricName": "tma_info_bottleneck_branching_overhead", - "MetricThreshold": "tma_info_bottleneck_branching_overhead > 5", - "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma= _store_bound)) * (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)= ) + tma_memory_bound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma= _l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound)) * (tma_sq_full= / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + tma_sq= _full)) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tma_l1_bound= + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound)) * (tma_f= b_full / (tma_dtlb_load + tma_fb_full + tma_l1_hit_latency + tma_lock_laten= cy + tma_split_loads + tma_store_fwd_blk)))", - "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", - "MetricName": "tma_info_bottleneck_cache_memory_bandwidth", - "MetricThreshold": "tma_info_bottleneck_cache_memory_bandwidth > 2= 0", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma= _store_bound)) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) = + tma_memory_bound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l= 2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound)) * (tma_l3_hit_la= tency / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + t= ma_sq_full)) + tma_memory_bound * tma_l2_bound / (tma_dram_bound + tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_store_bound) + tma= _memory_bound * (tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_= bound + tma_l3_bound + tma_pmm_bound + tma_store_bound)) * (tma_store_laten= cy / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_lat= ency + tma_streaming_stores)) + tma_memory_bound * (tma_l1_bound / (tma_dra= m_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma_= store_bound)) * (tma_l1_hit_latency / (tma_dtlb_load + tma_fb_full + tma_l1= _hit_latency + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)))", - "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", - "MetricName": "tma_info_bottleneck_cache_memory_latency", - "MetricThreshold": "tma_info_bottleneck_cache_memory_latency > 20", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" - }, - { - "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_amx_busy= + tma_divider + tma_ports_utilization + tma_serializing_operation) + tma_c= ore_bound * tma_amx_busy / (tma_amx_busy + tma_divider + tma_ports_utilizat= ion + tma_serializing_operation) + tma_core_bound * (tma_ports_utilization = / (tma_amx_busy + tma_divider + tma_ports_utilization + tma_serializing_ope= ration)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_utili= zed_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", - "MetricGroup": "BvCB;Cor;tma_issueComp", - "MetricName": "tma_info_bottleneck_compute_bound_est", - "MetricThreshold": "tma_info_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - (1 - I= NST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.MS\\,cmask\\=3D1@) * (tma_fetc= h_latency * (tma_ms_switches + tma_branch_resteers * (tma_clears_resteers += tma_mispredicts_resteers * tma_other_mispredicts / tma_branch_mispredicts)= / (tma_clears_resteers + tma_mispredicts_resteers + tma_unknown_branches))= / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_m= isses + tma_lcp + tma_ms_switches))) - tma_info_bottleneck_big_code", - "MetricGroup": "BvFB;Fed;FetchBW;Frontend", - "MetricName": "tma_info_bottleneck_instruction_fetch_bw", - "MetricThreshold": "tma_info_bottleneck_instruction_fetch_bw > 20" - }, - { - "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * ((1 - INST_RETIRED.REP_ITERATION / cpu@UOPS_R= ETIRED.MS\\,cmask\\=3D1@) * (tma_fetch_latency * (tma_ms_switches + tma_bra= nch_resteers * (tma_clears_resteers + tma_mispredicts_resteers * tma_other_= mispredicts / tma_branch_mispredicts) / (tma_clears_resteers + tma_mispredi= cts_resteers + tma_unknown_branches)) / (tma_branch_resteers + tma_dsb_swit= ches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)) + = 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_mispredic= ts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / tma_ot= her_nukes + tma_core_bound * (tma_serializing_operation + cpu@RS.EMPTY\\,um= ask\\=3D1@ / tma_info_thread_clks * tma_ports_utilized_0) / (tma_amx_busy += tma_divider + tma_ports_utilization + tma_serializing_operation) + tma_mic= rocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) * = (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", - "MetricName": "tma_info_bottleneck_irregular_overhead", - "MetricThreshold": "tma_info_bottleneck_irregular_overhead > 10", - "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" - }, - { - "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_pmm_bound + tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_= dtlb_load + tma_fb_full + tma_l1_hit_latency + tma_lock_latency + tma_split= _loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bound / (tma_d= ram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tm= a_store_bound)) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + t= ma_split_stores + tma_store_latency + tma_streaming_stores)))", - "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_data_tlbs", - "MetricThreshold": "tma_info_bottleneck_memory_data_tlbs > 20", - "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store, tma_info_bottleneck_memory_synchronization" - }, - { - "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_pmm_bound + tma= _store_bound) * (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) *= tma_remote_cache / (tma_local_mem + tma_remote_cache + tma_remote_mem) + t= ma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound = + tma_pmm_bound + tma_store_bound) * (tma_contested_accesses + tma_data_sha= ring) / (tma_contested_accesses + tma_data_sharing + tma_l3_hit_latency + t= ma_sq_full) + tma_store_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bou= nd + tma_l3_bound + tma_pmm_bound + tma_store_bound) * tma_false_sharing / = (tma_dtlb_store + tma_false_sharing + tma_split_stores + tma_store_latency = + tma_streaming_stores - tma_store_latency)) + tma_machine_clears * (1 - tm= a_other_nukes / tma_other_nukes))", - "MetricGroup": "BvMS;Mem;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_synchronization", - "MetricThreshold": "tma_info_bottleneck_memory_synchronization > 1= 0", - "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_dtlb_load, tma_dtlb_store, tma_info_bottleneck_me= mory_data_tlbs" - }, - { - "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", - "MetricName": "tma_info_bottleneck_mispredictions", - "MetricThreshold": "tma_info_bottleneck_mispredictions > 20", - "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" - }, - { - "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", - "MetricExpr": "100 - (tma_info_bottleneck_big_code + tma_info_bott= leneck_instruction_fetch_bw + tma_info_bottleneck_mispredictions + tma_info= _bottleneck_cache_memory_bandwidth + tma_info_bottleneck_cache_memory_laten= cy + tma_info_bottleneck_memory_data_tlbs + tma_info_bottleneck_memory_sync= hronization + tma_info_bottleneck_compute_bound_est + tma_info_bottleneck_i= rregular_overhead + tma_info_bottleneck_branching_overhead + tma_info_bottl= eneck_useful_work)", - "MetricGroup": "BvOB;Cor;Offcore", - "MetricName": "tma_info_bottleneck_other_bottlenecks", - "MetricThreshold": "tma_info_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls." - }, - { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", - "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "BvUW;Ret", - "MetricName": "tma_info_bottleneck_useful_work", - "MetricThreshold": "tma_info_bottleneck_useful_work > 20" + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5" }, { "BriefDescription": "Fraction of branches that are CALL or RET", @@ -943,11 +990,11 @@ "MetricExpr": "(FP_ARITH_DISPATCHED.PORT_0 + FP_ARITH_DISPATCHED.P= ORT_1 + FP_ARITH_DISPATCHED.PORT_5) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -960,20 +1007,20 @@ "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D1\\,edge@", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost" }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "ICACHE_DATA.STALLS / cpu@ICACHE_DATA.STALLS\\,cmask= \\=3D1\\,edge@", + "MetricExpr": "ICACHE_DATA.STALLS / cpu@ICACHE_DATA.STALLS\\,cmask= \\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency" }, @@ -1002,15 +1049,21 @@ "MetricGroup": "IcMiss", "MetricName": "tma_info_frontend_l2mpki_code_all" }, + { + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, { "BriefDescription": "Average number of cycles the front-end was de= layed due to an Unknown Branch detection", - "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu@INT_MISC.UNKNO= WN_BRANCH_CYCLES\\,cmask\\=3D1\\,edge@", + "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / cpu@INT_MISC.UNKNO= WN_BRANCH_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "Fed", "MetricName": "tma_info_frontend_unknown_branch_cost", - "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node." + "PublicDescription": "Average number of cycles the front-end was d= elayed due to an Unknown Branch detection. See Unknown_Branches node" }, { - "BriefDescription": "Branch instructions per taken branch.", + "BriefDescription": "Branch instructions per taken branch", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -1028,7 +1081,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -1036,7 +1089,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -1044,7 +1097,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit in= struction (lower number means higher occurrence rate)", @@ -1052,7 +1105,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx512", "MetricThreshold": "tma_info_inst_mix_iparith_avx512 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -1060,7 +1113,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Half-Pr= ecision instruction (lower number means higher occurrence rate)", @@ -1068,7 +1121,7 @@ "MetricGroup": "Flops;FpScalar;InsType;Server", "MetricName": "tma_info_inst_mix_iparith_scalar_hp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_hp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Half-P= recision instruction (lower number means higher occurrence rate). Values < = 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Half-P= recision instruction (lower number means higher occurrence rate). Values < = 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -1076,7 +1129,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -1121,7 +1174,7 @@ }, { "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", - "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umas= k\\=3D0xF@", + "MetricExpr": "INST_RETIRED.ANY / SW_PREFETCH_ACCESS.ANY", "MetricGroup": "Prefetches", "MetricName": "tma_info_inst_mix_ipswpf", "MetricThreshold": "tma_info_inst_mix_ipswpf < 100" @@ -1131,7 +1184,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 13", + "MetricThreshold": "tma_info_inst_mix_iptb < 6 * 2 + 1", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp" }, { @@ -1178,7 +1231,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw" }, @@ -1196,7 +1249,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw" }, @@ -1238,13 +1291,13 @@ }, { "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", - "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration= _time", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", "MetricGroup": "Mem;MemoryBW;Offcore", "MetricName": "tma_info_memory_l3_cache_access_bw" }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw" }, @@ -1263,12 +1316,12 @@ { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency" }, { "BriefDescription": "Average Parallel L2 cache miss demand Loads", - "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D1@", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", "MetricGroup": "Memory_BW;Offcore", "MetricName": "tma_info_memory_latency_load_l2_mlp" }, @@ -1321,26 +1374,33 @@ "MetricName": "tma_info_memory_mlp", "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)" }, + { + "BriefDescription": "Rate of L2 HW prefetched lines that were not = used by demand accesses", + "MetricExpr": "L2_LINES_OUT.USELESS_HWPF / (L2_LINES_OUT.SILENT + = L2_LINES_OUT.NON_SILENT)", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_memory_prefetches_useless_hwpf", + "MetricThreshold": "tma_info_memory_prefetches_useless_hwpf > 0.15" + }, { "BriefDescription": "Average DRAM BW for Reads-to-Core (R2C) cover= ing for memory attached to local- and remote-socket", - "MetricExpr": "64 * OCR.READS_TO_CORE.DRAM / 1e9 / duration_time", + "MetricExpr": "64 * OCR.READS_TO_CORE.DRAM / 1e9 / tma_info_system= _time", "MetricGroup": "HPC;Mem;MemoryBW;SoC", "MetricName": "tma_info_memory_soc_r2c_dram_bw", - "PublicDescription": "Average DRAM BW for Reads-to-Core (R2C) cove= ring for memory attached to local- and remote-socket. See R2C_Offcore_BW." + "PublicDescription": "Average DRAM BW for Reads-to-Core (R2C) cove= ring for memory attached to local- and remote-socket. See R2C_Offcore_BW" }, { "BriefDescription": "Average L3-cache miss BW for Reads-to-Core (R= 2C)", - "MetricExpr": "64 * OCR.READS_TO_CORE.L3_MISS / 1e9 / duration_tim= e", + "MetricExpr": "64 * OCR.READS_TO_CORE.L3_MISS / 1e9 / tma_info_sys= tem_time", "MetricGroup": "HPC;Mem;MemoryBW;SoC", "MetricName": "tma_info_memory_soc_r2c_l3m_bw", - "PublicDescription": "Average L3-cache miss BW for Reads-to-Core (= R2C). This covering going to DRAM or other memory off-chip memory tears. Se= e R2C_Offcore_BW." + "PublicDescription": "Average L3-cache miss BW for Reads-to-Core (= R2C). This covering going to DRAM or other memory off-chip memory tears. Se= e R2C_Offcore_BW" }, { "BriefDescription": "Average Off-core access BW for Reads-to-Core = (R2C)", - "MetricExpr": "64 * OCR.READS_TO_CORE.ANY_RESPONSE / 1e9 / duratio= n_time", + "MetricExpr": "64 * OCR.READS_TO_CORE.ANY_RESPONSE / 1e9 / tma_inf= o_system_time", "MetricGroup": "HPC;Mem;MemoryBW;SoC", "MetricName": "tma_info_memory_soc_r2c_offcore_bw", - "PublicDescription": "Average Off-core access BW for Reads-to-Core= (R2C). R2C account for demand or prefetch load/RFO/code access that fill d= ata into the Core caches." + "PublicDescription": "Average Off-core access BW for Reads-to-Core= (R2C). R2C account for demand or prefetch load/RFO/code access that fill d= ata into the Core caches" }, { "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", @@ -1369,7 +1429,7 @@ }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D1@)", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, @@ -1390,18 +1450,18 @@ "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, { "BriefDescription": "Estimated fraction of retirement-cycles deali= ng with repeat instructions", - "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.SLOTS= \\,cmask\\=3D1@", + "MetricExpr": "INST_RETIRED.REP_ITERATION / cpu@UOPS_RETIRED.SLOTS= \\,cmask\\=3D0x1@", "MetricGroup": "MicroSeq;Pipeline;Ret", "MetricName": "tma_info_pipeline_strings_cycles", "MetricThreshold": "tma_info_pipeline_strings_cycles > 0.1" @@ -1415,7 +1475,7 @@ }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -1433,28 +1493,28 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / duration_time", + "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / tma_info_system_time", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", - "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_bottlenec= k_cache_memory_bandwidth, tma_mem_bandwidth, tma_sq_full" + "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_bottleneck_cache_memory_ban= dwidth, tma_fb_full, tma_mem_bandwidth, tma_sq_full" }, { "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED2.SCALAR_HALF + 2 * (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_= INST_RETIRED2.COMPLEX_SCALAR_HALF) + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 = * (FP_ARITH_INST_RETIRED2.128B_PACKED_HALF + FP_ARITH_INST_RETIRED.8_FLOPS)= + 16 * (FP_ARITH_INST_RETIRED2.256B_PACKED_HALF + FP_ARITH_INST_RETIRED.51= 2B_PACKED_SINGLE) + 32 * FP_ARITH_INST_RETIRED2.512B_PACKED_HALF) / 1e9 / d= uration_time", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED2.SCALAR_HALF + 2 * (FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARITH_= INST_RETIRED2.COMPLEX_SCALAR_HALF) + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 = * (FP_ARITH_INST_RETIRED2.128B_PACKED_HALF + FP_ARITH_INST_RETIRED.8_FLOPS)= + 16 * (FP_ARITH_INST_RETIRED2.256B_PACKED_HALF + FP_ARITH_INST_RETIRED.51= 2B_PACKED_SINGLE) + 32 * FP_ARITH_INST_RETIRED2.512B_PACKED_HALF) / 1e9 / t= ma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" }, { "BriefDescription": "Average IO (network or disk) Bandwidth Use fo= r Reads [GB / sec]", - "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e9 / durati= on_time", + "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e9 / tma_in= fo_system_time", "MetricGroup": "IoBW;MemOffcore;Server;SoC", "MetricName": "tma_info_system_io_read_bw", "PublicDescription": "Average IO (network or disk) Bandwidth Use f= or Reads [GB / sec]. Bandwidth of IO reads that are initiated by end device= controllers that are requesting memory from the CPU" }, { "BriefDescription": "Average IO (network or disk) Bandwidth Use fo= r Writes [GB / sec]", - "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM + UNC_CHA_TOR_INSERTS.= IO_ITOMCACHENEAR) * 64 / 1e9 / duration_time", + "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM + UNC_CHA_TOR_INSERTS.= IO_ITOMCACHENEAR) * 64 / 1e9 / tma_info_system_time", "MetricGroup": "IoBW;MemOffcore;Server;SoC", "MetricName": "tma_info_system_io_write_bw", "PublicDescription": "Average IO (network or disk) Bandwidth Use f= or Writes [GB / sec]. Bandwidth of IO writes that are initiated by end devi= ce controllers that are writing memory to the CPU" @@ -1464,13 +1524,14 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi" + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -1481,44 +1542,45 @@ }, { "BriefDescription": "Average latency of data read request to exter= nal DRAM memory [in nanoseconds]", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR / UNC_= CHA_TOR_INSERTS.IA_MISS_DRD_DDR) / uncore_cha_0@event\\=3D0x1@", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_DDR / UNC_= CHA_TOR_INSERTS.IA_MISS_DRD_DDR) / cha_0@event\\=3D0x0@", "MetricGroup": "MemOffcore;MemoryLat;Server;SoC", "MetricName": "tma_info_system_mem_dram_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal DRAM memory [in nanoseconds]. Accounts for demand loads and L1/L2 data= -read prefetches" }, + { + "BriefDescription": "Fraction of Uncore cycles where requests got = rejected due to duplicate address already in IRQ ingress queue in the cache= homing agent", + "MetricExpr": "UNC_CHA_RxC_IRQ1_REJECT.PA_MATCH / UNC_CHA_CLOCKTIC= KS", + "MetricGroup": "LockCont;MemOffcore;Server;SoC", + "MetricName": "tma_info_system_mem_irq_duplicate_address", + "MetricThreshold": "(tma_info_system_mem_irq_duplicate_address > 0= .1)" + }, { "BriefDescription": "Average number of parallel data read requests= to external memory", - "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_OCC= UPANCY.IA_MISS_DRD@thresh\\=3D1@", + "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / cha@UNC_CHA_TOR= _OCCUPANCY.IA_MISS_DRD\\,thresh\\=3D0x1@", "MetricGroup": "Mem;MemoryBW;SoC", "MetricName": "tma_info_system_mem_parallel_reads", "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches" }, - { - "BriefDescription": "Average latency of data read request to exter= nal 3D X-Point memory [in nanoseconds]", - "MetricExpr": "(1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_PMM / UNC= _CHA_TOR_INSERTS.IA_MISS_DRD_PMM) / uncore_cha_0@event\\=3D0x1@ if #has_pme= m > 0 else 0)", - "MetricGroup": "MemOffcore;MemoryLat;Server;SoC", - "MetricName": "tma_info_system_mem_pmm_read_latency", - "PublicDescription": "Average latency of data read request to exte= rnal 3D X-Point memory [in nanoseconds]. Accounts for demand loads and L1/L= 2 data-read prefetches" - }, { "BriefDescription": "Average latency of data read request to exter= nal memory (in nanoseconds)", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_= TOR_INSERTS.IA_MISS_DRD) / (tma_info_system_socket_clks / duration_time)", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_= TOR_INSERTS.IA_MISS_DRD) / (tma_info_system_socket_clks / tma_info_system_t= ime)", "MetricGroup": "Mem;MemoryLat;SoC", "MetricName": "tma_info_system_mem_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetche= s. ([RKL+]memory-controller only)" }, { - "BriefDescription": "Average 3DXP Memory Bandwidth Use for reads [= GB / sec]", - "MetricExpr": "(64 * UNC_M_PMM_RPQ_INSERTS / 1e9 / duration_time i= f #has_pmem > 0 else 0)", - "MetricGroup": "MemOffcore;MemoryBW;Server;SoC", - "MetricName": "tma_info_system_pmm_read_bw" + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" }, { - "BriefDescription": "Average 3DXP Memory Bandwidth Use for Writes = [GB / sec]", - "MetricExpr": "(64 * UNC_M_PMM_WPQ_INSERTS / 1e9 / duration_time i= f #has_pmem > 0 else 0)", - "MetricGroup": "MemOffcore;MemoryBW;Server;SoC", - "MetricName": "tma_info_system_pmm_write_bw" + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "(power@energy\\-pkg@ * 61 + 15.6 * power@energy\\-r= am@) / (duration_time * 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", @@ -1528,10 +1590,17 @@ }, { "BriefDescription": "Socket actual clocks when any core is active = on that socket", - "MetricExpr": "uncore_cha_0@event\\=3D0x1@", + "MetricExpr": "cha_0@event\\=3D0x0@", "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -1540,7 +1609,7 @@ }, { "BriefDescription": "Measured Average Uncore Frequency for the SoC= [GHz]", - "MetricExpr": "tma_info_system_socket_clks / 1e9 / duration_time", + "MetricExpr": "tma_info_system_socket_clks / 1e9 / tma_info_system= _time", "MetricGroup": "SoC", "MetricName": "tma_info_system_uncore_frequency" }, @@ -1551,7 +1620,7 @@ "MetricName": "tma_info_system_upi_data_transmit_bw" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -1560,14 +1629,15 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi" + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -1577,13 +1647,13 @@ }, { "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "TOPDOWN.SLOTS", + "MetricExpr": "slots", "MetricGroup": "TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots" }, { "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", - "MetricExpr": "(tma_info_thread_slots / (TOPDOWN.SLOTS / 2) if #SM= T_on else 1)", + "MetricExpr": "(tma_info_thread_slots / (slots / 2) if #SMT_on els= e 1)", "MetricGroup": "SMT;TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots_utilization" }, @@ -1599,7 +1669,15 @@ "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 9" + "MetricThreshold": "tma_info_thread_uptb < 6 * 1.5" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "MetricExpr": "tma_divider - tma_fp_divider", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_int_divider", + "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents overall Integer (Int) = select operations fraction the CPU has executed (retired)", @@ -1607,7 +1685,7 @@ "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_int_operations", "MetricThreshold": "tma_int_operations > 0.1 & tma_light_operation= s > 0.6", - "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain.", + "PublicDescription": "This metric represents overall Integer (Int)= select operations fraction the CPU has executed (retired). Vector/Matrix I= nt operations and shuffles are counted. Note this metric's value may exceed= its parent due to use of \"Uops\" CountDomain", "ScaleUnit": "100%" }, { @@ -1615,8 +1693,8 @@ "MetricExpr": "(INT_VEC_RETIRED.ADD_128 + INT_VEC_RETIRED.VNNI_128= ) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", "MetricName": "tma_int_vector_128b", - "MetricThreshold": "tma_int_vector_128b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", - "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_256b, tma= _port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_int_vector_128b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 128-bit vector Intege= r ADD/SUB/SAD or VNNI (Vector Neural Network Instructions) uops fraction th= e CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_ve= ctor_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_256b, tma= _port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1624,8 +1702,8 @@ "MetricExpr": "(INT_VEC_RETIRED.ADD_256 + INT_VEC_RETIRED.MUL_256 = + INT_VEC_RETIRED.VNNI_256) / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Compute;IntVector;Pipeline;TopdownL4;tma_L4_group;= tma_int_operations_group;tma_issue2P", "MetricName": "tma_int_vector_256b", - "MetricThreshold": "tma_int_vector_256b > 0.1 & (tma_int_operation= s > 0.1 & tma_light_operations > 0.6)", - "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b,= tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_int_vector_256b > 0.1 & tma_int_operations= > 0.1 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents 256-bit vector Intege= r ADD/SUB/SAD/MUL or VNNI (Vector Neural Network Instructions) uops fractio= n the CPU has retired. Related metrics: tma_fp_scalar, tma_fp_vector, tma_f= p_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1633,26 +1711,26 @@ "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", "MetricExpr": "max((EXE_ACTIVITY.BOUND_ON_LOADS - MEMORY_ACTIVITY.= STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB= _HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_micr= ocode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates fraction of cyc= les with demand load accesses that hit the L1 cache", + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - MEMORY_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", - "MetricName": "tma_l1_hit_latency", - "MetricThreshold": "tma_l1_hit_latency > 0.1 & (tma_l1_bound > 0.1= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles with demand load accesses that hit the L1 cache. The short latency of = the L1 data cache may be exposed in pointer-chasing memory access patterns = as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%" }, { @@ -1660,8 +1738,17 @@ "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L1D_MISS - MEMORY_ACTIVITY.= STALLS_L2_MISS) / tma_info_thread_clks", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT_PS", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "4.4 * tma_info_system_core_frequency * MEM_LOAD_RET= IRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) = / tma_info_thread_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, { @@ -1669,17 +1756,17 @@ "MetricExpr": "(MEMORY_ACTIVITY.STALLS_L2_MISS - MEMORY_ACTIVITY.S= TALLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "32.6 * tma_info_system_core_frequency * (MEM_LOAD_R= ETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2= )) / tma_info_thread_clks", + "MetricExpr": "(37 * tma_info_system_core_frequency - 4.4 * tma_in= fo_system_core_frequency) * (MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRE= D.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_info_bottleneck_cache_memory_latency, tma_mem_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", "ScaleUnit": "100%" }, { @@ -1687,19 +1774,18 @@ "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", - "DefaultMetricgroupName": "TopdownL2", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", - "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_re= tiring_group", + "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", "ScaleUnit": "100%" }, { @@ -1716,7 +1802,7 @@ "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1724,53 +1810,76 @@ "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from local memory", - "MetricExpr": "72 * tma_info_system_core_frequency * MEM_LOAD_L3_M= ISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1= _MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(109 * tma_info_system_core_frequency - 37 * tma_in= fo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_= LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_grou= p", "MetricName": "tma_local_mem", - "MetricThreshold": "tma_local_mem > 0.1 & (tma_mem_latency > 0.1 &= (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)= ))", + "MetricThreshold": "tma_local_mem > 0.1 & tma_mem_latency > 0.1 & = tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_L3_MISS= _RETIRED.LOCAL_DRAM", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", "MetricExpr": "(16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10= * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAN= DING.CYCLES_WITH_DEMAND_RFO))) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Machine Clears", - "DefaultMetricgroupName": "TopdownL2", "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts= )", - "MetricGroup": "BadSpec;BvMS;Default;MachineClears;TmaL2;TopdownL2= ;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", + "MetricGroup": "BadSpec;BvMS;MachineClears;TmaL2;TopdownL2;tma_L2_= group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn", "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches, tma_remote_cache", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to memory bandwidth Allocation= feature (RDT's memory bandwidth throttling).", + "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to memory bandwidth Allocation= feature (RDT's memory bandwidth throttling)", "MetricExpr": "INT_MISC.MBA_STALLS / tma_info_thread_clks", "MetricGroup": "MemoryBW;Offcore;Server;TopdownL5;tma_L5_group;tma= _mem_bandwidth_group", "MetricName": "tma_mba_stalls", - "MetricThreshold": "tma_mba_stalls > 0.1 & (tma_mem_bandwidth > 0.= 2 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0= .2)))", + "MetricThreshold": "tma_mba_stalls > 0.1 & tma_mem_bandwidth > 0.2= & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_bottleneck_cache= _memory_bandwidth, tma_info_system_dram_bw_use, tma_sq_full", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%" }, { @@ -1778,32 +1887,31 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_info_bottleneck_cache_memory_latency, tma_l3_hit_la= tency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of slots the = Memory subsystem within the Backend was a bottleneck", - "DefaultMetricgroupName": "TopdownL2", - "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_in= fo_thread_slots", - "MetricGroup": "Backend;Default;TmaL2;TopdownL2;tma_L2_group;tma_b= ackend_bound_group", + "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + top= down\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_b= ound_group", "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", - "MetricgroupNoGroup": "TopdownL2;Default", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "MetricgroupNoGroup": "TopdownL2", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions.", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to LFENCE Instructions", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "13 * MISC2_RETIRED.LFENCE / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_memory_fence", - "MetricThreshold": "tma_memory_fence > 0.05 & (tma_serializing_ope= ration > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_memory_fence > 0.05 & tma_serializing_oper= ation > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", "MetricExpr": "tma_light_operations * MEM_UOP_RETIRED.ANY / (tma_r= etiring * tma_info_thread_slots)", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_memory_operations", @@ -1816,7 +1924,7 @@ "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", "MetricName": "tma_microcode_sequencer", "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", - "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_clears_reste= ers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_clea= rs, tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: UOPS_RETIRED.MS. Related metrics: tma_bottleneck_i= rregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, t= ma_ms_switches", "ScaleUnit": "100%" }, { @@ -1824,8 +1932,8 @@ "MetricExpr": "tma_branch_mispredicts / tma_bad_speculation * INT_= MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_branch_mispredicts, tma_info_bad_spec_branch_misprediction_cost= , tma_info_bottleneck_mispredictions", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%" }, { @@ -1838,21 +1946,29 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", "MetricExpr": "160 * ASSISTS.SSE_AVX_MIX / tma_info_thread_clks", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "MetricExpr": "max(IDQ.MS_CYCLES_ANY, cpu@UOPS_RETIRED.MS\\,cmask\= \=3D0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY)) / tma_info_core_core_clk= s / 2", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", + "MetricName": "tma_ms", + "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates the fraction of cycles = when the CPU was stalled due to switches of uop delivery to the Microcode S= equencer (MS)", - "MetricExpr": "3 * cpu@UOPS_RETIRED.MS\\,cmask\\=3D1\\,edge@ / (UO= PS_RETIRED.SLOTS / UOPS_ISSUED.ANY) / tma_info_thread_clks", + "MetricExpr": "3 * cpu@UOPS_RETIRED.MS\\,cmask\\=3D0x1\\,edge\\=3D= 0x1@ / (UOPS_RETIRED.SLOTS / UOPS_ISSUED.ANY) / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: FRONTEND_RETIRED.MS_FLOWS. Related metrics: tm= a_clears_resteers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tm= a_machine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializ= ing_operation", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: FRONTEND_RETIRED.MS_FLOWS. Related metrics: tm= a_bottleneck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_mac= hine_clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_o= peration", "ScaleUnit": "100%" }, { @@ -1861,7 +1977,7 @@ "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_non_fused_branches", "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused.", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", "ScaleUnit": "100%" }, { @@ -1869,7 +1985,7 @@ "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, @@ -1883,19 +1999,19 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { @@ -1904,16 +2020,7 @@ "MetricGroup": "TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_page_faults", "MetricThreshold": "tma_page_faults > 0.05", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost.", - "ScaleUnit": "100%" - }, - { - "BriefDescription": "This metric roughly estimates (based on idle = latencies) how often the CPU was stalled on accesses to external 3D-Xpoint = (Crystal Ridge, a.k.a", - "MetricExpr": "(((1 - (19 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM = * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 10 * (MEM_LOA= D_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETI= RED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD * (1 + MEM_LOAD_RETIRED.= FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM *= (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))) / (19 * (MEM_LO= AD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RE= TIRED.L1_MISS)) + 10 * (MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD= _RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOAD_L3_MISS_RETIRED.REMO= TE_FWD * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + MEM_LOA= D_L3_MISS_RETIRED.REMOTE_HITM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RET= IRED.L1_MISS)) + (25 * (MEM_LOAD_RETIRED.LOCAL_PMM * (1 + MEM_LOAD_RETIRED.= FB_HIT / MEM_LOAD_RETIRED.L1_MISS)) + 33 * (MEM_LOAD_L3_MISS_RETIRED.REMOTE= _PMM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS))))) * (MEMO= RY_ACTIVITY.STALLS_L3_MISS / tma_info_thread_clks) if 1e6 * (MEM_LOAD_L3_MI= SS_RETIRED.REMOTE_PMM + MEM_LOAD_RETIRED.LOCAL_PMM) > MEM_LOAD_RETIRED.L1_M= ISS else 0) if #has_pmem > 0 else 0)", - "MetricGroup": "MemoryBound;Server;TmaL3mem;TopdownL3;tma_L3_group= ;tma_memory_bound_group", - "MetricName": "tma_pmm_bound", - "MetricThreshold": "tma_pmm_bound > 0.1 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric roughly estimates (based on idle= latencies) how often the CPU was stalled on accesses to external 3D-Xpoint= (Crystal Ridge, a.k.a. IXP) memory by loads, PMM stands for Persistent Mem= ory Module.", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Page Faults. A Page Fault m= ay apply on first application access to a memory page. Note operating syste= m handling of page faults accounts for the majority of its cost", "ScaleUnit": "100%" }, { @@ -1922,7 +2029,7 @@ "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_int_vector_128b, tma_int_vector_256b, tma_port_1, tma_port_5, tma_po= rt_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED.PORT_0. Related metrics: tma_fp_scala= r, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512= b, tma_int_vector_128b, tma_int_vector_256b, tma_port_1, tma_port_6, tma_po= rts_utilized_2", "ScaleUnit": "100%" }, { @@ -1931,7 +2038,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_ve= ctor_256b, tma_port_0, tma_port_5, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_12= 8b, tma_fp_vector_256b, tma_fp_vector_512b, tma_int_vector_128b, tma_int_ve= ctor_256b, tma_port_0, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1940,25 +2047,25 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_6. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_por= t_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_por= ts_utilized_2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU performance was potentially limited due to Core computation issues (non= divider-related)", - "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * cpu@EXE_ACTIVITY.2_PORTS_UTIL\\,um= ask\\=3D0xc@)) / tma_info_thread_clks if ARITH.DIV_ACTIVE < CYCLE_ACTIVITY.= STALLS_TOTAL - EXE_ACTIVITY.BOUND_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL = + tma_retiring * cpu@EXE_ACTIVITY.2_PORTS_UTIL\\,umask\\=3D0xc@) / tma_info= _thread_clks)", + "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_3_PORTS_UTIL)) / tm= a_info_thread_clks if ARITH.DIV_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - EXE_= ACTIVITY.BOUND_ON_LOADS else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring * EX= E_ACTIVITY.2_3_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", - "MetricExpr": "(EXE_ACTIVITY.EXE_BOUND_0_PORTS + max(cpu@RS.EMPTY\= \,umask\\=3D1@ - RESOURCE_STALLS.SCOREBOARD, 0)) / tma_info_thread_clks * (= CYCLE_ACTIVITY.STALLS_TOTAL - EXE_ACTIVITY.BOUND_ON_LOADS) / tma_info_threa= d_clks", + "MetricExpr": "(EXE_ACTIVITY.EXE_BOUND_0_PORTS + max(RS.EMPTY_RESO= URCE - RESOURCE_STALLS.SCOREBOARD, 0)) / tma_info_thread_clks * (CYCLE_ACTI= VITY.STALLS_TOTAL - EXE_ACTIVITY.BOUND_ON_LOADS) / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", "ScaleUnit": "100%" }, { @@ -1966,7 +2073,7 @@ "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", "ScaleUnit": "100%" }, @@ -1976,8 +2083,8 @@ "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= int_vector_128b, tma_int_vector_256b, tma_port_0, tma_port_1, tma_port_6", "ScaleUnit": "100%" }, { @@ -1986,36 +2093,35 @@ "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote cache in other socket= s including synchronizations issues", - "MetricExpr": "(133 * tma_info_system_core_frequency * MEM_LOAD_L3= _MISS_RETIRED.REMOTE_HITM + 133 * tma_info_system_core_frequency * MEM_LOAD= _L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETI= RED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "((170 * tma_info_system_core_frequency - 37 * tma_i= nfo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM + (170 * = tma_info_system_core_frequency - 37 * tma_info_system_core_frequency) * MEM= _LOAD_L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD= _RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_is= sueSyncxn;tma_mem_latency_group", "MetricName": "tma_remote_cache", - "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0= .1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > = 0.2)))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_L3_MISS_R= ETIRED.REMOTE_HITM_PS;MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD_PS. Related metri= cs: tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_machin= e_clears", + "MetricThreshold": "tma_remote_cache > 0.05 & tma_mem_latency > 0.= 1 & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2= ", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. Sample with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM, MEM= _LOAD_L3_MISS_RETIRED.REMOTE_FWD. Related metrics: tma_bottleneck_memory_sy= nchronization, tma_contested_accesses, tma_data_sharing, tma_false_sharing,= tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote memory", - "MetricExpr": "153 * tma_info_system_core_frequency * MEM_LOAD_L3_= MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.= L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(190 * tma_info_system_core_frequency - 37 * tma_in= fo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 + MEM= _LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks= ", "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latenc= y_group", "MetricName": "tma_remote_mem", - "MetricThreshold": "tma_remote_mem > 0.1 & (tma_mem_latency > 0.1 = & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. #link to NUMA article. Sample= with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM_PS", + "MetricThreshold": "tma_remote_mem > 0.1 & tma_mem_latency > 0.1 &= tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. Sample with: MEM_LOAD_L3_MISS= _RETIRED.REMOTE_DRAM", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", - "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_= thread_slots", - "MetricGroup": "BvUW;Default;TmaL1;TopdownL1;tma_L1_group", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", + "MetricGroup": "BvUW;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", - "MetricgroupNoGroup": "TopdownL1;Default", + "MetricgroupNoGroup": "TopdownL1", "PublicDescription": "This category represents fraction of slots u= tilized by useful work i.e. issued uops that eventually get retired. Ideall= y; all pipeline slots would be attributed to the Retiring category. Retiri= ng of 100% would indicate the maximum Pipeline_Width throughput was achieve= d. Maximizing Retiring typically increases the Instructions-per-cycle (see= IPC metric). Note that a high Retiring value does not necessary mean there= is no room for more performance. For example; Heavy-operations or Microco= de Assists are categorized under Retiring. They often indicate suboptimal p= erformance and can often be optimized or avoided. Sample with: UOPS_RETIRED= .SLOTS", "ScaleUnit": "100%" }, @@ -2024,7 +2130,7 @@ "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_thread_clks += tma_c02_wait", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", "ScaleUnit": "100%" }, @@ -2033,8 +2139,8 @@ "MetricExpr": "tma_light_operations * INT_VEC_RETIRED.SHUFFLES / (= tma_retiring * tma_info_thread_slots)", "MetricGroup": "HPC;Pipeline;TopdownL4;tma_L4_group;tma_other_ligh= t_ops_group", "MetricName": "tma_shuffles_256b", - "MetricThreshold": "tma_shuffles_256b > 0.1 & (tma_other_light_ops= > 0.3 & tma_light_operations > 0.6)", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers.", + "MetricThreshold": "tma_shuffles_256b > 0.1 & tma_other_light_ops = > 0.3 & tma_light_operations > 0.6", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring Shuffle operations of 256-bit vector size (FP or In= teger). Shuffles may incur slow cross \"vector lane\" data transfers", "ScaleUnit": "100%" }, { @@ -2043,7 +2149,7 @@ "MetricExpr": "CPU_CLK_UNHALTED.PAUSE / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_slow_pause", - "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: CPU_CLK_UNHALTED.= PAUSE_INST", "ScaleUnit": "100%" }, @@ -2052,8 +2158,8 @@ "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", "ScaleUnit": "100%" }, { @@ -2061,17 +2167,17 @@ "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", "ScaleUnit": "100%" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", "MetricExpr": "(XQ.FULL_CYCLES + L1D_PEND_MISS.L2_STALLS) / tma_in= fo_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_bottleneck_cache_memory_bandwidth, tma_info_system_dram_bw_use, = tma_mem_bandwidth", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%" }, { @@ -2079,8 +2185,8 @@ "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", "ScaleUnit": "100%" }, { @@ -2088,17 +2194,17 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", "MetricExpr": "(MEM_STORE_RETIRED.L2_HIT * 10 * (1 - MEM_INST_RETI= RED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_= LOADS / MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE= _REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -2115,7 +2221,7 @@ "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", "ScaleUnit": "100%" }, { @@ -2123,7 +2229,31 @@ "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -2131,7 +2261,7 @@ "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", "MetricName": "tma_streaming_stores", - "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound = > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", "ScaleUnit": "100%" }, @@ -2140,7 +2270,7 @@ "MetricExpr": "INT_MISC.UNKNOWN_BRANCH_CYCLES / tma_info_thread_cl= ks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: FRONTEND_RETIRED.UNKNOWN_BRANCH", "ScaleUnit": "100%" }, @@ -2149,8 +2279,8 @@ "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", "ScaleUnit": "100%" }, { diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/uncore-io.json b= /tools/perf/pmu-events/arch/x86/sapphirerapids/uncore-io.json index 91013ced74aa..aab082ff9402 100644 --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/uncore-io.json +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/uncore-io.json @@ -201,7 +201,7 @@ "PerPkg": "1", "PortMask": "0x0001", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 0 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -214,7 +214,7 @@ "PerPkg": "1", "PortMask": "0x0002", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 1 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 1", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -227,7 +227,7 @@ "PerPkg": "1", "PortMask": "0x0004", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 2 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 2", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -240,7 +240,7 @@ "PerPkg": "1", "PortMask": "0x0008", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 2 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 3", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -253,7 +253,7 @@ "PerPkg": "1", "PortMask": "0x0010", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 0 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 4", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -266,7 +266,7 @@ "PerPkg": "1", "PortMask": "0x0020", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 1 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 5", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -279,7 +279,7 @@ "PerPkg": "1", "PortMask": "0x0040", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 2 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 6", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -292,7 +292,7 @@ "PerPkg": "1", "PortMask": "0x0080", "PublicDescription": "PCIe Completion Buffer Inserts of completion= s with data : Part 2 : x16 card plugged in to Lane 0/1/2/3, Or x8 card plug= ged in to Lane 0/1, Or x4 card is plugged in to slot 7", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -316,7 +316,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x16 card plugged in to stack, Or x8 card plu= gged in to Lane 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7000001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -329,7 +329,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x4 card is plugged in to slot 1", - "UMask": "0x7000002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -342,7 +342,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x8 card plugged in to Lane 2/3, Or x4 card i= s plugged in to slot 1", - "UMask": "0x7000004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -355,7 +355,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x4 card is plugged in to slot 3", - "UMask": "0x7000008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -368,7 +368,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x16 card plugged in to stack, Or x8 card plu= gged in to Lane 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7000010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -381,7 +381,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x4 card is plugged in to slot 1", - "UMask": "0x7000020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -394,7 +394,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x8 card plugged in to Lane 2/3, Or x4 card i= s plugged in to slot 1", - "UMask": "0x7000040", + "UMask": "0x40", "Unit": "IIO" }, { @@ -407,7 +407,7 @@ "PerPkg": "1", "PortMask": "0x0000", "PublicDescription": "x4 card is plugged in to slot 3", - "UMask": "0x7000080", + "UMask": "0x80", "Unit": "IIO" }, { @@ -431,7 +431,7 @@ "PerPkg": "1", "PortMask": "0x0001", "PublicDescription": "Data requested by the CPU : Core reading fro= m Card's MMIO space : Number of DWs (4 bytes) requested by the main die. I= ncludes all requests initiated by the main die, including reads and writes.= : x16 card plugged in to stack, Or x8 card plugged in to Lane 0/1, Or x4 c= ard is plugged in to slot 0", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -443,7 +443,7 @@ "PerPkg": "1", "PortMask": "0x0002", "PublicDescription": "Data requested by the CPU : Core reading fro= m Card's MMIO space : Number of DWs (4 bytes) requested by the main die. I= ncludes all requests initiated by the main die, including reads and writes.= : x4 card is plugged in to slot 1", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -455,7 +455,7 @@ "PerPkg": "1", "PortMask": "0x0004", "PublicDescription": "Data requested by the CPU : Core reading fro= m Card's MMIO space : Number of DWs (4 bytes) requested by the main die. I= ncludes all requests initiated by the main die, including reads and writes.= : x8 card plugged in to Lane 2/3, Or x4 card is plugged in to slot 1", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -467,7 +467,7 @@ "PerPkg": "1", "PortMask": "0x0008", "PublicDescription": "Data requested by the CPU : Core reading fro= m Card's MMIO space : Number of DWs (4 bytes) requested by the main die. I= ncludes all requests initiated by the main die, including reads and writes.= : x4 card is plugged in to slot 3", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -479,7 +479,7 @@ "PerPkg": "1", "PortMask": "0x0010", "PublicDescription": "Data requested by the CPU : Core reading fro= m Cards MMIO space : Number of DWs (4 bytes) requested by the main die. In= cludes all requests initiated by the main die, including reads and writes. = : x16 card plugged in to stack, Or x8 card plugged in to Lane 0/1, Or x4 ca= rd is plugged in to slot 0", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -491,7 +491,7 @@ "PerPkg": "1", "PortMask": "0x0020", "PublicDescription": "Data requested by the CPU : Core reading fro= m Cards MMIO space : Number of DWs (4 bytes) requested by the main die. In= cludes all requests initiated by the main die, including reads and writes. = : x4 card is plugged in to slot 1", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -503,7 +503,7 @@ "PerPkg": "1", "PortMask": "0x0040", "PublicDescription": "Data requested by the CPU : Core reading fro= m Cards MMIO space : Number of DWs (4 bytes) requested by the main die. In= cludes all requests initiated by the main die, including reads and writes. = : x8 card plugged in to Lane 2/3, Or x4 card is plugged in to slot 1", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -515,7 +515,7 @@ "PerPkg": "1", "PortMask": "0x0080", "PublicDescription": "Data requested by the CPU : Core reading fro= m Cards MMIO space : Number of DWs (4 bytes) requested by the main die. In= cludes all requests initiated by the main die, including reads and writes. = : x4 card is plugged in to slot 3", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -662,7 +662,7 @@ "PerPkg": "1", "PortMask": "0x0001", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x16 card plugged in to stack, Or x8 card plugged = in to Lane 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7001008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -675,7 +675,7 @@ "PerPkg": "1", "PortMask": "0x0002", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x4 card is plugged in to slot 1", - "UMask": "0x7002008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -688,7 +688,7 @@ "PerPkg": "1", "PortMask": "0x0004", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x8 card plugged in to Lane 2/3, Or x4 card is plu= gged in to slot 1", - "UMask": "0x7004008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -701,7 +701,7 @@ "PerPkg": "1", "PortMask": "0x0008", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x4 card is plugged in to slot 3", - "UMask": "0x7008008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -714,7 +714,7 @@ "PerPkg": "1", "PortMask": "0x0010", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x16 card plugged in to stack, Or x8 card plugged = in to Lane 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7010008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -727,7 +727,7 @@ "PerPkg": "1", "PortMask": "0x0020", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x4 card is plugged in to slot 1", - "UMask": "0x7020008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -740,7 +740,7 @@ "PerPkg": "1", "PortMask": "0x0040", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x8 card plugged in to Lane 2/3, Or x4 card is plu= gged in to slot 1", - "UMask": "0x7040008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -753,7 +753,7 @@ "PerPkg": "1", "PortMask": "0x0080", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) reading from this card. : Number of DWs (4 bytes) reques= ted by the main die. Includes all requests initiated by the main die, incl= uding reads and writes. : x4 card is plugged in to slot 3", - "UMask": "0x7080008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -766,7 +766,7 @@ "PerPkg": "1", "PortMask": "0x0001", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x16 card plugged in to stack, Or x8 card plugged in= to Lane 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7001002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -779,7 +779,7 @@ "PerPkg": "1", "PortMask": "0x0002", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x4 card is plugged in to slot 1", - "UMask": "0x7002002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -792,7 +792,7 @@ "PerPkg": "1", "PortMask": "0x0004", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x8 card plugged in to Lane 2/3, Or x4 card is plugg= ed in to slot 1", - "UMask": "0x7004002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -805,7 +805,7 @@ "PerPkg": "1", "PortMask": "0x0008", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x4 card is plugged in to slot 3", - "UMask": "0x7008002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -818,7 +818,7 @@ "PerPkg": "1", "PortMask": "0x0010", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x16 card plugged in to stack, Or x8 card plugged in= to Lane 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7010002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -831,7 +831,7 @@ "PerPkg": "1", "PortMask": "0x0020", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x4 card is plugged in to slot 1", - "UMask": "0x7020002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -844,7 +844,7 @@ "PerPkg": "1", "PortMask": "0x0040", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x8 card plugged in to Lane 2/3, Or x4 card is plugg= ed in to slot 1", - "UMask": "0x7040002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -857,7 +857,7 @@ "PerPkg": "1", "PortMask": "0x0080", "PublicDescription": "Data requested by the CPU : Another card (di= fferent IIO stack) writing to this card. : Number of DWs (4 bytes) requeste= d by the main die. Includes all requests initiated by the main die, includ= ing reads and writes. : x4 card is plugged in to slot 3", - "UMask": "0x7080002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -974,7 +974,6 @@ "Counter": "0,1", "EventCode": "0x83", "EventName": "UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.ALL_PARTS", - "Experimental": "1", "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x00ff", @@ -1082,7 +1081,6 @@ "Counter": "0,1", "EventCode": "0x83", "EventName": "UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.ALL_PARTS", - "Experimental": "1", "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x00ff", @@ -1299,7 +1297,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests : Passing data= to be written : How often different queues (e.g. channel / fc) ask to send= request into pipeline : Only for posted requests", - "UMask": "0x70ff020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -1351,7 +1349,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests : Request Owne= rship : How often different queues (e.g. channel / fc) ask to send request = into pipeline : Only for posted requests", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1364,7 +1362,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests : Writing line= : How often different queues (e.g. channel / fc) ask to send request into = pipeline : Only for posted requests", - "UMask": "0x70ff010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -1377,7 +1375,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests granted : Pass= ing data to be written : How often different queues (e.g. channel / fc) are= allowed to send request into pipeline : Only for posted requests", - "UMask": "0x70ff020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -1390,7 +1388,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests granted : Issu= ing final read or write of line : How often different queues (e.g. channel = / fc) are allowed to send request into pipeline", - "UMask": "0x70ff008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1403,7 +1401,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests granted : Proc= essing response from IOMMU : How often different queues (e.g. channel / fc)= are allowed to send request into pipeline", - "UMask": "0x70ff002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1416,7 +1414,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests granted : Issu= ing to IOMMU : How often different queues (e.g. channel / fc) are allowed t= o send request into pipeline", - "UMask": "0x70ff001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1429,7 +1427,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests granted : Requ= est Ownership : How often different queues (e.g. channel / fc) are allowed = to send request into pipeline : Only for posted requests", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1442,7 +1440,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Incoming arbitration requests granted : Writ= ing line : How often different queues (e.g. channel / fc) are allowed to se= nd request into pipeline : Only for posted requests", - "UMask": "0x70ff010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -1962,7 +1960,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Request Ownership : PCIe Request complete : = Only for posted requests : Each PCIe request is broken down into a series o= f cacheline granular requests and each cacheline size request may need to m= ake multiple passes through the pipeline (e.g. for posted interrupts or mul= ti-cast). Each time a single PCIe request completes all its cacheline gra= nular requests, it advances pointer.", - "UMask": "0x70ff020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -1975,7 +1973,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Request Ownership : Writing line : Only for = posted requests : Only for posted requests", - "UMask": "0x70ff008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1988,7 +1986,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Request Ownership : Issuing final read or wr= ite of line : Only for posted requests", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -2001,7 +1999,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Request Ownership : Passing data to be writt= en : Only for posted requests : Only for posted requests", - "UMask": "0x70ff010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -2014,7 +2012,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Processing response from IOMMU : Passing dat= a to be written : Only for posted requests", - "UMask": "0x70ff008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -2026,7 +2024,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x00FF", - "UMask": "0x70ff002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2039,7 +2037,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Processing response from IOMMU : Request Own= ership : Only for posted requests", - "UMask": "0x70ff001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -2052,7 +2050,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "Processing response from IOMMU : Writing lin= e : Only for posted requests", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -2065,7 +2063,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "PCIe Request - pass complete : Passing data = to be written : Each PCIe request is broken down into a series of cacheline= granular requests and each cacheline size request may need to make multipl= e passes through the pipeline (e.g. for posted interrupts or multi-cast). = Each time a cacheline completes a single pass (e.g. posts a write to singl= e multi-cast target) it advances state : Only for posted requests", - "UMask": "0x70ff020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -2091,7 +2089,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "PCIe Request - pass complete : Request Owner= ship : Each PCIe request is broken down into a series of cacheline granular= requests and each cacheline size request may need to make multiple passes = through the pipeline (e.g. for posted interrupts or multi-cast). Each tim= e a cacheline completes a single pass (e.g. posts a write to single multi-c= ast target) it advances state : Only for posted requests", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -2104,7 +2102,7 @@ "PerPkg": "1", "PortMask": "0x00FF", "PublicDescription": "PCIe Request - pass complete : Writing line = : Each PCIe request is broken down into a series of cacheline granular requ= ests and each cacheline size request may need to make multiple passes throu= gh the pipeline (e.g. for posted interrupts or multi-cast). Each time a c= acheline completes a single pass (e.g. posts a write to single multi-cast t= arget) it advances state : Only for posted requests", - "UMask": "0x70ff010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -2309,7 +2307,7 @@ "PerPkg": "1", "PortMask": "0x0001", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x16 card plugged in to Lane 0/1/2/3, Or x8 card plugged in to Lane= 0/1, Or x4 card is plugged in to slot 0", - "UMask": "0x7001002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2322,7 +2320,7 @@ "PerPkg": "1", "PortMask": "0x0002", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x4 card is plugged in to slot 1", - "UMask": "0x7002002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2335,7 +2333,7 @@ "PerPkg": "1", "PortMask": "0x0004", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x8 card plugged in to Lane 2/3, Or x4 card is plugged in to slot 2= ", - "UMask": "0x7004002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2348,7 +2346,7 @@ "PerPkg": "1", "PortMask": "0x0008", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x4 card is plugged in to slot 3", - "UMask": "0x7008002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2361,7 +2359,7 @@ "PerPkg": "1", "PortMask": "0x0010", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x16 card plugged in to Lane 4/5/6/7, Or x8 card plugged in to Lane= 4/5, Or x4 card is plugged in to slot 4", - "UMask": "0x7010002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2374,7 +2372,7 @@ "PerPkg": "1", "PortMask": "0x0020", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x4 card is plugged in to slot 5", - "UMask": "0x7020002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2387,7 +2385,7 @@ "PerPkg": "1", "PortMask": "0x0040", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x8 card plugged in to Lane 6/7, Or x4 card is plugged in to slot 6= ", - "UMask": "0x7040002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -2400,7 +2398,7 @@ "PerPkg": "1", "PortMask": "0x0080", "PublicDescription": "Number Transactions requested by the CPU : A= nother card (different IIO stack) writing to this card. : Also known as Out= bound. Number of requests initiated by the main die, including reads and w= rites. : x4 card is plugged in to slot 7", - "UMask": "0x7080002", + "UMask": "0x2", "Unit": "IIO" }, { --=20 2.48.0.rc2.279.g1de40edade-goog From nobody Sun Dec 14 13:49:51 2025 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A960F198A29 for ; Thu, 16 Jan 2025 06:44:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009918; cv=none; b=SXH+7j7Iep+j0nOOkIZe9DSdWUiekfTlgJpfbChLy1r0y/WPAmksj1Zwil2Kaqg5zJCer4cSWWSzn4Qkag9aJo07yMpymOcSs730BUtLqHQREiVbcOYg7VVT+hdUEVmEBIqxM9AGxAWngrxouICSmTUNrpcCn+agRy6uxf7fAQQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009918; c=relaxed/simple; bh=ezISKNZhgGk5zrd24bUMJCP45PdhX4tRp9E/SGy0g3Y=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=iUsUcCB3m6Bj7l+7/wPz1Ng549nOeyClDkpUP9N4DVyEoZ2H3Hv9ISNVq48ISWV2w8K2wp/U0CjUTQLJP+oVaR4ROEAB8j1IXMVHi+MocH6asqh9Uns+f4TMUu8N2BzRxtCSBcccQ5UVubRKse6tmnHEwEukFkd2MiOo0rFC0Q0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=DSO1U/eu; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="DSO1U/eu" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e3988f71863so1556860276.0 for ; Wed, 15 Jan 2025 22:44:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737009897; x=1737614697; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=A/Zmg9PrPJkRpd6KDXQWvOk5KQKNDk1UUUo3Whyw4mg=; b=DSO1U/eu/l/R+42jrx5MLHiF+ABgxuVakS7dvT/BMrzMNYPZOyYzSlrtdjnAhprFfm g3A0R8EjYjQcv7Y6IZBRxQ/thySI6e6dczJaaN/ebOQ53kcBiKjeFs0MhnpbzZV621e3 D5ItHkIEqUz5YidjiZGBcZuqkYGypSRObjkEzt3ATWM34nzvZMPUmUJ5l/AoczkpVTMG sa5nXRBkHCpEENM09gvjbT9PfsLRHfygxFoVrK7zigmpIFZo1AQXcgzHwu9BnWWkMto1 UGEgXYJwpiXwOWBSQcDQ76G/2RncREzZw9brhBHB9N7MRya8prz+Fyxz93ZISCneMiNY S1+A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737009897; x=1737614697; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=A/Zmg9PrPJkRpd6KDXQWvOk5KQKNDk1UUUo3Whyw4mg=; b=se0BeGzJ467Ppuy9NHr+j1xLfXcTmv85K4lrjb22McCqYOToo73FOH1Wt62D7su9ru kbNyeDx1Oc6rMfkKfuaCUcHebAgULL+JdEi/Hvv5oT6laVABLd2sEvDh+ShuRGEUHiVp idcWDrp11Ofpf+dXgzGg5UdNgm9Xpsqdu2CREspUL2lIqGOF/U/PlwdN7AhmKpij6Tg+ xh81xCyVIbno/zn2zz0DVYpwKWoiafoAniEt25Ab0C41xCIUtuc5iHypUkWPnoLEtNrf oSymMslIMKIsODoUAPQEfynpeVbxp2q0Lu/uf6HUCFf+i8sMrg7a65a2E9GP9cJQ5C1k lnhQ== X-Forwarded-Encrypted: i=1; AJvYcCVJCVUPnY0nDc7mFtliVytUVCf3Unz+79Fg89zE6ylfBtdK93RoRmpg373iU+YmOlMK5mODEYMFAKuUUJA=@vger.kernel.org X-Gm-Message-State: AOJu0YzThiYXzeYndY6+G05W1NqYW48elzavef/45iZByl+KMbcCPoP/ MIijkzRiRZ4X87rJ7g/ipxEUUAuC8ZzzqAClE5tpY5L7gga0xXh/ksiBoTwig4Kv/vXqmKwrgpF OMmMFtA== X-Google-Smtp-Source: AGHT+IEgzxq8Wn8fa+2dxY+/Mot7++rBkmMqNK7RThVjP4UmKdA1xZJtUQer3ylqP5WMKR+nwHtPhMgARfrU X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:7bd:ce19:c701:bfa3]) (user=irogers job=sendgmr) by 2002:a05:690c:6a85:b0:6ee:dda9:3f28 with SMTP id 00721157ae682-6f5312ddb83mr968867b3.4.1737009897275; Wed, 15 Jan 2025 22:44:57 -0800 (PST) Date: Wed, 15 Jan 2025 22:43:52 -0800 In-Reply-To: <20250116064355.344245-1-irogers@google.com> Message-Id: <20250116064355.344245-21-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250116064355.344245-1-irogers@google.com> X-Mailer: git-send-email 2.48.0.rc2.279.g1de40edade-goog Subject: [PATCH v2 20/23] perf vendor events: Update Sierraforest events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.04 to v1.07. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v1.07: https://github.com/intel/perfmon/commit/903b3d0a0a61bb6064013db9eb4c26457da= cfea6 https://github.com/intel/perfmon/commit/825c4361473e676119b51f04c7896a8cfa8= a5ea5 https://github.com/intel/perfmon/commit/bafe6a7b5cbee92c31ec19dfcefd6dcc243= e4e8a The TMA 5.01 addition is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Update uncore IIO events umask with the change: https://github.com/intel/perfmon/commit/d78e8a166537c9ceab4f2e901dc96c53667= a2174 which should address an issue originally raised by Michael Petlan: Reported-by: Michael Petlan Closes: https://lore.kernel.org/all/alpine.LRH.2.20.2401300733310.11354@Die= go/ Co-developed-by: Caleb Biggers Signed-off-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- .../arch/x86/sierraforest/counter.json | 2 +- .../arch/x86/sierraforest/other.json | 20 ++ .../arch/x86/sierraforest/pipeline.json | 3 +- .../arch/x86/sierraforest/srf-metrics.json | 308 ++++++------------ .../arch/x86/sierraforest/uncore-cache.json | 28 +- .../x86/sierraforest/uncore-interconnect.json | 87 +++++ .../arch/x86/sierraforest/uncore-io.json | 280 ++++++++-------- .../arch/x86/sierraforest/uncore-memory.json | 122 ++++++- .../arch/x86/sierraforest/uncore-power.json | 98 ++++++ 10 files changed, 581 insertions(+), 369 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 9e37d38036a4..344ca1873553 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -29,7 +29,7 @@ GenuineIntel-6-2E,v4,nehalemex,core GenuineIntel-6-A7,v1.04,rocketlake,core GenuineIntel-6-2A,v19,sandybridge,core GenuineIntel-6-8F,v1.25,sapphirerapids,core -GenuineIntel-6-AF,v1.04,sierraforest,core +GenuineIntel-6-AF,v1.07,sierraforest,core GenuineIntel-6-(37|4A|4C|4D|5A),v15,silvermont,core GenuineIntel-6-(4E|5E|8E|9E|A5|A6),v59,skylake,core GenuineIntel-6-55-[01234],v1.35,skylakex,core diff --git a/tools/perf/pmu-events/arch/x86/sierraforest/counter.json b/too= ls/perf/pmu-events/arch/x86/sierraforest/counter.json index e57e3bf98b2a..c62a147e360b 100644 --- a/tools/perf/pmu-events/arch/x86/sierraforest/counter.json +++ b/tools/perf/pmu-events/arch/x86/sierraforest/counter.json @@ -52,7 +52,7 @@ { "Unit": "PCU", "CountersNumFixed": "0", - "CountersNumGeneric": 4 + "CountersNumGeneric": "4" }, { "Unit": "CHACMS", diff --git a/tools/perf/pmu-events/arch/x86/sierraforest/other.json b/tools= /perf/pmu-events/arch/x86/sierraforest/other.json index 28f9a4c3ea84..4c77dac8ec78 100644 --- a/tools/perf/pmu-events/arch/x86/sierraforest/other.json +++ b/tools/perf/pmu-events/arch/x86/sierraforest/other.json @@ -18,6 +18,26 @@ "SampleAfterValue": "100003", "UMask": "0x1" }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM attached to this socket.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_DATA_RD.LOCAL_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x184000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, + { + "BriefDescription": "Counts demand data reads that were supplied b= y DRAM attached to another socket.", + "Counter": "0,1,2,3,4,5,6,7", + "EventCode": "0xB7", + "EventName": "OCR.DEMAND_DATA_RD.REMOTE_DRAM", + "MSRIndex": "0x1a6,0x1a7", + "MSRValue": "0x730000001", + "SampleAfterValue": "100003", + "UMask": "0x1" + }, { "BriefDescription": "Counts demand reads for ownership (RFO) and s= oftware prefetches for exclusive ownership (PREFETCHW) that have any type o= f response.", "Counter": "0,1,2,3,4,5,6,7", diff --git a/tools/perf/pmu-events/arch/x86/sierraforest/pipeline.json b/to= ols/perf/pmu-events/arch/x86/sierraforest/pipeline.json index b67c0c89054d..40fa4f5ae261 100644 --- a/tools/perf/pmu-events/arch/x86/sierraforest/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/sierraforest/pipeline.json @@ -1,6 +1,6 @@ [ { - "BriefDescription": "Counts the number of cycles when any of the d= ividers are active.", + "BriefDescription": "Counts the number of cycles when any of the f= loating point or integer dividers are active.", "Counter": "0,1,2,3,4,5,6,7", "CounterMask": "1", "EventCode": "0xcd", @@ -185,7 +185,6 @@ "BriefDescription": "Fixed Counter: Counts the number of instructi= ons retired", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PEBS": "1", "SampleAfterValue": "2000003", "UMask": "0x1" }, diff --git a/tools/perf/pmu-events/arch/x86/sierraforest/srf-metrics.json b= /tools/perf/pmu-events/arch/x86/sierraforest/srf-metrics.json index b881b1958f11..83c86afd2960 100644 --- a/tools/perf/pmu-events/arch/x86/sierraforest/srf-metrics.json +++ b/tools/perf/pmu-events/arch/x86/sierraforest/srf-metrics.json @@ -1,4 +1,11 @@ [ + { + "BriefDescription": "C10 residency percent per package", + "MetricExpr": "cstate_pkg@c10\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C10_Pkg_Residency", + "ScaleUnit": "100%" + }, { "BriefDescription": "C1 residency percent per core", "MetricExpr": "cstate_core@c1\\-residency@ / TSC", @@ -7,17 +14,24 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per core", - "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "BriefDescription": "C2 residency percent per package", + "MetricExpr": "cstate_pkg@c2\\-residency@ / TSC", "MetricGroup": "Power", - "MetricName": "C6_Core_Residency", + "MetricName": "C2_Pkg_Residency", "ScaleUnit": "100%" }, { - "BriefDescription": "C6 residency percent per module", - "MetricExpr": "cstate_module@c6\\-residency@ / TSC", + "BriefDescription": "C3 residency percent per package", + "MetricExpr": "cstate_pkg@c3\\-residency@ / TSC", "MetricGroup": "Power", - "MetricName": "C6_Module_Residency", + "MetricName": "C3_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C6 residency percent per core", + "MetricExpr": "cstate_core@c6\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C6_Core_Residency", "ScaleUnit": "100%" }, { @@ -28,7 +42,21 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles.", + "BriefDescription": "C7 residency percent per core", + "MetricExpr": "cstate_core@c7\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C7_Core_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "C8 residency percent per package", + "MetricExpr": "cstate_pkg@c8\\-residency@ / TSC", + "MetricGroup": "Power", + "MetricName": "C8_Pkg_Residency", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles", "MetricExpr": "CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY", "MetricName": "cpi", "ScaleUnit": "1per_instr" @@ -49,67 +77,67 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte page sizes) caused by demand data loads to the total number of c= ompleted instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRE= D.ANY", "MetricName": "dtlb_2nd_level_2mb_large_page_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte page sizes) caused by demand data loads to the total number of = completed instructions. This implies it missed in the Data Translation Look= aside Buffer (DTLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte page sizes) caused by demand data loads to the total number of = completed instructions. This implies it missed in the Data Translation Look= aside Buffer (DTLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data loads to the total number of complete= d instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "dtlb_2nd_level_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data stores to the total number of complet= ed instructions", "MetricExpr": "DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", "MetricName": "dtlb_2nd_level_store_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB", "ScaleUnit": "1per_instr" }, { - "BriefDescription": "Bandwidth observed by the integrated I/O traf= fic contoller (IIO) of IO reads that are initiated by end device controller= s that are requesting memory from the CPU.", + "BriefDescription": "Bandwidth observed by the integrated I/O traf= fic contoller (IIO) of IO reads that are initiated by end device controller= s that are requesting memory from the CPU", "MetricExpr": "UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.ALL_PARTS * 4 / 1e= 6 / duration_time", "MetricName": "iio_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth observed by the integrated I/O traf= fic controller (IIO) of IO writes that are initiated by end device controll= ers that are writing memory to the CPU.", + "BriefDescription": "Bandwidth observed by the integrated I/O traf= fic controller (IIO) of IO writes that are initiated by end device controll= ers that are writing memory to the CPU", "MetricExpr": "UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.ALL_PARTS * 4 / 1= e6 / duration_time", "MetricName": "iio_bandwidth_write", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU.", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR * 64 / 1e6 / durati= on_time", "MetricName": "io_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the local CPU socket.= ", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the local CPU socket", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR_LOCAL * 64 / 1e6 / = duration_time", "MetricName": "io_bandwidth_read_local", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from a remote CPU socket.", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from a remote CPU socket", "MetricExpr": "UNC_CHA_TOR_INSERTS.IO_PCIRDCUR_REMOTE * 64 / 1e6 /= duration_time", "MetricName": "io_bandwidth_read_remote", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM + UNC_CHA_TOR_INSERTS.= IO_ITOMCACHENEAR) * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_write", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the local CPU socket.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the local CPU socket", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM_LOCAL + UNC_CHA_TOR_IN= SERTS.IO_ITOMCACHENEAR_LOCAL) * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_write_local", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to a remote CPU socket.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to a remote CPU socket", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IO_ITOM_REMOTE + UNC_CHA_TOR_I= NSERTS.IO_ITOMCACHENEAR_REMOTE) * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_write_remote", "ScaleUnit": "1MB/s" @@ -118,14 +146,14 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total n= umber of completed instructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY= ", "MetricName": "itlb_2nd_level_large_page_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by a code fetch to the total number of completed ins= tructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "itlb_2nd_level_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB", "ScaleUnit": "1per_instr" }, { @@ -177,25 +205,25 @@ "ScaleUnit": "1ns" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to local memory.", + "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to local memory", "MetricExpr": "UNC_CHA_REQUESTS.READS_LOCAL * 64 / 1e6 / duration_= time", "MetricName": "llc_miss_local_memory_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to local memory.", + "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to local memory", "MetricExpr": "UNC_CHA_REQUESTS.WRITES_LOCAL * 64 / 1e6 / duration= _time", "MetricName": "llc_miss_local_memory_bandwidth_write", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to remote memory.", + "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to remote memory", "MetricExpr": "UNC_CHA_REQUESTS.READS_REMOTE * 64 / 1e6 / duration= _time", "MetricName": "llc_miss_remote_memory_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to remote memory.", + "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to remote memory", "MetricExpr": "UNC_CHA_REQUESTS.WRITES_REMOTE * 64 / 1e6 / duratio= n_time", "MetricName": "llc_miss_remote_memory_bandwidth_write", "ScaleUnit": "1MB/s" @@ -225,13 +253,13 @@ "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches.", + "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_DRD_OPT_LOCAL + UNC_CH= A_TOR_INSERTS.IA_MISS_DRD_OPT_PREF_LOCAL) / (UNC_CHA_TOR_INSERTS.IA_MISS_DR= D_OPT_LOCAL + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_OPT_PREF_LOCAL + UNC_CHA_TOR_= INSERTS.IA_MISS_DRD_OPT_REMOTE + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_OPT_PREF_R= EMOTE)", "MetricName": "numa_reads_addressed_to_local_dram", "ScaleUnit": "100%" }, { - "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches.", + "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches", "MetricExpr": "(UNC_CHA_TOR_INSERTS.IA_MISS_DRD_OPT_REMOTE + UNC_C= HA_TOR_INSERTS.IA_MISS_DRD_OPT_PREF_REMOTE) / (UNC_CHA_TOR_INSERTS.IA_MISS_= DRD_OPT_LOCAL + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_OPT_PREF_LOCAL + UNC_CHA_TO= R_INSERTS.IA_MISS_DRD_OPT_REMOTE + UNC_CHA_TOR_INSERTS.IA_MISS_DRD_OPT_PREF= _REMOTE)", "MetricName": "numa_reads_addressed_to_remote_dram", "ScaleUnit": "100%" @@ -260,17 +288,15 @@ { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to certain allocation restrictions", "MetricExpr": "tma_core_bound", - "MetricGroup": "TopdownL3;tma_L3_group;tma_core_bound_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_allocation_restriction", - "MetricThreshold": "tma_allocation_restriction > 0.1 & (tma_core_b= ound > 0.1 & tma_backend_bound > 0.1)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend due to backend stalls", "MetricExpr": "TOPDOWN_BE_BOUND.ALL_P / (6 * CPU_CLK_UNHALTED.CORE= )", - "MetricGroup": "TopdownL1;tma_L1_group", + "MetricGroup": "Slots;TopdownL1;tma_L1_group", "MetricName": "tma_backend_bound", - "MetricThreshold": "tma_backend_bound > 0.1", "MetricgroupNoGroup": "TopdownL1", "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend due to backend stalls. Note that uops must= be available for consumption in order for this event to count. If a uop is= not available (IQ is empty), this event will not count", "ScaleUnit": "100%" @@ -278,104 +304,92 @@ { "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend because allocation is stalled due to a misp= redicted jump or a machine clear", "MetricExpr": "TOPDOWN_BAD_SPECULATION.ALL_P / (6 * CPU_CLK_UNHALT= ED.CORE)", - "MetricGroup": "TopdownL1;tma_L1_group", + "MetricGroup": "Slots;TopdownL1;tma_L1_group", "MetricName": "tma_bad_speculation", - "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend because allocation is stalled due to a mis= predicted jump or a machine clear. Only issue slots wasted due to fast nuke= s such as memory ordering nukes are counted. Other nukes are not accounted = for. Counts all issue slots blocked during this recovery window including r= elevant microcode flows and while uops are not yet available in the instruc= tion queue (IQ). Also includes the issue slots that were consumed by the ba= ckend but were thrown away because they were younger than the mispredict or= machine clear.", + "PublicDescription": "Counts the total number of issue slots that = were not consumed by the backend because allocation is stalled due to a mis= predicted jump or a machine clear. Only issue slots wasted due to fast nuke= s such as memory ordering nukes are counted. Other nukes are not accounted = for. Counts all issue slots blocked during this recovery window including r= elevant microcode flows and while uops are not yet available in the instruc= tion queue (IQ). Also includes the issue slots that were consumed by the ba= ckend but were thrown away because they were younger than the mispredict or= machine clear", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to BACLEARS, which occurs when the Branch T= arget Buffer (BTB) prediction or lack thereof, was corrected by a later bra= nch predictor in the frontend", "MetricExpr": "TOPDOWN_FE_BOUND.BRANCH_DETECT / (6 * CPU_CLK_UNHAL= TED.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_latency_gr= oup", "MetricName": "tma_branch_detect", - "MetricThreshold": "tma_branch_detect > 0.05 & (tma_ifetch_latency= > 0.15 & tma_frontend_bound > 0.2)", - "PublicDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to BACLEARS, which occurs when the Branch = Target Buffer (BTB) prediction or lack thereof, was corrected by a later br= anch predictor in the frontend. Includes BACLEARS due to all branch types i= ncluding conditional and unconditional jumps, returns, and indirect branche= s.", + "PublicDescription": "Counts the number of issue slots that were n= ot delivered by the frontend due to BACLEARS, which occurs when the Branch = Target Buffer (BTB) prediction or lack thereof, was corrected by a later br= anch predictor in the frontend. Includes BACLEARS due to all branch types i= ncluding conditional and unconditional jumps, returns, and indirect branche= s", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to branch mispredicts", "MetricExpr": "TOPDOWN_BAD_SPECULATION.MISPREDICT / (6 * CPU_CLK_U= NHALTED.CORE)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", + "MetricGroup": "Slots;TopdownL2;tma_L2_group;tma_bad_speculation_g= roup", "MetricName": "tma_branch_mispredicts", - "MetricThreshold": "tma_branch_mispredicts > 0.05 & tma_bad_specul= ation > 0.15", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to BTCLEARS, which occurs when the Branch T= arget Buffer (BTB) predicts a taken branch.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to BTCLEARS, which occurs when the Branch T= arget Buffer (BTB) predicts a taken branch", "MetricExpr": "TOPDOWN_FE_BOUND.BRANCH_RESTEER / (6 * CPU_CLK_UNHA= LTED.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_latency_gr= oup", "MetricName": "tma_branch_resteer", - "MetricThreshold": "tma_branch_resteer > 0.05 & (tma_ifetch_latenc= y > 0.15 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to the microcode sequencer (MS).", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to the microcode sequencer (MS)", "MetricExpr": "TOPDOWN_FE_BOUND.CISC / (6 * CPU_CLK_UNHALTED.CORE)= ", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_bandwidth_= group", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.05 & (tma_ifetch_bandwidth > 0.1 = & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of cycles due to backend bo= und stalls that are bounded by core restrictions and not attributed to an o= utstanding load or stores, or resource limitation", "MetricExpr": "TOPDOWN_BE_BOUND.ALLOC_RESTRICTIONS / (6 * CPU_CLK_= UNHALTED.CORE)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricGroup": "Slots;TopdownL2;tma_L2_group;tma_backend_bound_gro= up", "MetricName": "tma_core_bound", - "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.1= ", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to decode stalls.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to decode stalls", "MetricExpr": "TOPDOWN_FE_BOUND.DECODE / (6 * CPU_CLK_UNHALTED.COR= E)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_bandwidth_= group", "MetricName": "tma_decode", - "MetricThreshold": "tma_decode > 0.05 & (tma_ifetch_bandwidth > 0.= 1 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that does not require the = use of microcode, classified as a fast nuke, due to memory ordering, memory= disambiguation and memory renaming", "MetricExpr": "TOPDOWN_BAD_SPECULATION.FASTNUKE / (6 * CPU_CLK_UNH= ALTED.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_machine_clears_gr= oup", "MetricName": "tma_fast_nuke", - "MetricThreshold": "tma_fast_nuke > 0.05 & (tma_machine_clears > 0= .05 & tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to frontend stalls.", + "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to frontend stalls", "MetricExpr": "TOPDOWN_FE_BOUND.ALL_P / (6 * CPU_CLK_UNHALTED.CORE= )", - "MetricGroup": "TopdownL1;tma_L1_group", + "MetricGroup": "Slots;TopdownL1;tma_L1_group", "MetricName": "tma_frontend_bound", - "MetricThreshold": "tma_frontend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to instruction cache misses.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to instruction cache misses", "MetricExpr": "TOPDOWN_FE_BOUND.ICACHE / (6 * CPU_CLK_UNHALTED.COR= E)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_latency_gr= oup", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_ifetch_latency= > 0.15 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend bandwidth restrictions due to d= ecode, predecode, cisc, and other limitations.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend bandwidth restrictions due to d= ecode, predecode, cisc, and other limitations", "MetricExpr": "TOPDOWN_FE_BOUND.FRONTEND_BANDWIDTH / (6 * CPU_CLK_= UNHALTED.CORE)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricGroup": "Slots;TopdownL2;tma_L2_group;tma_frontend_bound_gr= oup", "MetricName": "tma_ifetch_bandwidth", - "MetricThreshold": "tma_ifetch_bandwidth > 0.1 & tma_frontend_boun= d > 0.2", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend latency restrictions due to ica= che misses, itlb misses, branch detection, and resteer limitations.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to frontend latency restrictions due to ica= che misses, itlb misses, branch detection, and resteer limitations", "MetricExpr": "TOPDOWN_FE_BOUND.FRONTEND_LATENCY / (6 * CPU_CLK_UN= HALTED.CORE)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_frontend_bound_group", + "MetricGroup": "Slots;TopdownL2;tma_L2_group;tma_frontend_bound_gr= oup", "MetricName": "tma_ifetch_latency", - "MetricThreshold": "tma_ifetch_latency > 0.15 & tma_frontend_bound= > 0.2", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, @@ -403,32 +417,6 @@ "MetricGroup": "Flops", "MetricName": "tma_info_arith_inst_mix_ipfparith_scalar_sp" }, - { - "BriefDescription": "Percentage of time that retirement is stalled= due to a first level data TLB miss", - "MetricExpr": "tma_info_bottleneck_dtlb_miss_bound_cycles", - "MetricName": "tma_info_bottleneck_%_dtlb_miss_bound_cycles" - }, - { - "BriefDescription": "Percentage of time that allocation and retire= ment is stalled by the Frontend Cluster due to an Ifetch Miss, either Icach= e or ITLB Miss", - "MetricExpr": "tma_info_bottleneck_ifetch_miss_bound_cycles", - "MetricGroup": "Ifetch", - "MetricName": "tma_info_bottleneck_%_ifetch_miss_bound_cycles", - "PublicDescription": "Percentage of time that allocation and retir= ement is stalled by the Frontend Cluster due to an Ifetch Miss, either Icac= he or ITLB Miss. See Info.Ifetch_Bound" - }, - { - "BriefDescription": "Percentage of time that retirement is stalled= due to an L1 miss", - "MetricExpr": "tma_info_bottleneck_load_miss_bound_cycles", - "MetricGroup": "Load_Store_Miss", - "MetricName": "tma_info_bottleneck_%_load_miss_bound_cycles", - "PublicDescription": "Percentage of time that retirement is stalle= d due to an L1 miss. See Info.Load_Miss_Bound" - }, - { - "BriefDescription": "Percentage of time that retirement is stalled= by the Memory Cluster due to a pipeline stall", - "MetricExpr": "tma_info_bottleneck_mem_exec_bound_cycles", - "MetricGroup": "Mem_Exec", - "MetricName": "tma_info_bottleneck_%_mem_exec_bound_cycles", - "PublicDescription": "Percentage of time that retirement is stalle= d by the Memory Cluster due to a pipeline stall. See Info.Mem_Exec_Bound" - }, { "BriefDescription": "Percentage of time that retirement is stalled= due to a first level data TLB miss", "MetricExpr": "100 * (LD_HEAD.DTLB_MISS_AT_RET + LD_HEAD.PGWALK_AT= _RET) / CPU_CLK_UNHALTED.CORE", @@ -510,21 +498,6 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / BACLEARS.ANY", "MetricName": "tma_info_br_mispredict_bound_branch_mispredict_to_u= nknown_branch_ratio" }, - { - "BriefDescription": "Percentage of time that allocation is stalled= due to load buffer full", - "MetricExpr": "tma_info_buffer_stalls_load_buffer_stall_cycles", - "MetricName": "tma_info_buffer_stalls_%_load_buffer_stall_cycles" - }, - { - "BriefDescription": "Percentage of time that allocation is stalled= due to memory reservation stations full", - "MetricExpr": "tma_info_buffer_stalls_mem_rsv_stall_cycles", - "MetricName": "tma_info_buffer_stalls_%_mem_rsv_stall_cycles" - }, - { - "BriefDescription": "Percentage of time that allocation is stalled= due to store buffer full", - "MetricExpr": "tma_info_buffer_stalls_store_buffer_stall_cycles", - "MetricName": "tma_info_buffer_stalls_%_store_buffer_stall_cycles" - }, { "BriefDescription": "Percentage of time that allocation is stalled= due to load buffer full", "MetricExpr": "100 * MEM_SCHEDULER_BLOCK.LD_BUF / CPU_CLK_UNHALTED= .CORE", @@ -546,7 +519,8 @@ { "BriefDescription": "Cycles Per Instruction", "MetricExpr": "CPU_CLK_UNHALTED.CORE / INST_RETIRED.ANY", - "MetricName": "tma_info_core_cpi" + "MetricName": "tma_info_core_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Floating Point Operations Per Cycle", @@ -564,21 +538,6 @@ "MetricExpr": "TOPDOWN_RETIRING.ALL_P / INST_RETIRED.ANY", "MetricName": "tma_info_core_upi" }, - { - "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L2", - "MetricExpr": "tma_info_ifetch_miss_bound_ifetchmissbound_with_l2h= it", - "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 2hit" - }, - { - "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L3", - "MetricExpr": "tma_info_ifetch_miss_bound_ifetchmissbound_with_l3h= it", - "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 3hit" - }, - { - "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss subsequently misses in the L3", - "MetricExpr": "100 * MEM_BOUND_STALLS_IFETCH.LLC_MISS / MEM_BOUND_= STALLS_IFETCH.ALL", - "MetricName": "tma_info_ifetch_miss_bound_%_ifetchmissbound_with_l= 3miss" - }, { "BriefDescription": "Percentage of ifetch miss bound stalls, where= the ifetch miss hits in the L2", "MetricExpr": "100 * MEM_BOUND_STALLS_IFETCH.L2_HIT / MEM_BOUND_ST= ALLS_IFETCH.ALL", @@ -591,24 +550,6 @@ "MetricName": "tma_info_ifetch_miss_bound_ifetchmissbound_with_l3h= it", "ScaleUnit": "100%" }, - { - "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L2", - "MetricExpr": "tma_info_load_miss_bound_loadmissbound_with_l2hit", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l2hit" - }, - { - "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L3", - "MetricExpr": "tma_info_load_miss_bound_loadmissbound_with_l3hit", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l3hit" - }, - { - "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that subsequently misses the L3", - "MetricExpr": "100 * MEM_BOUND_STALLS_LOAD.LLC_MISS / MEM_BOUND_ST= ALLS_LOAD.ALL", - "MetricGroup": "load_store_bound", - "MetricName": "tma_info_load_miss_bound_%_loadmissbound_with_l3mis= s" - }, { "BriefDescription": "Percentage of memory bound stalls where retir= ement is stalled due to an L1 miss that hit the L2", "MetricExpr": "100 * MEM_BOUND_STALLS_LOAD.L2_HIT / MEM_BOUND_STAL= LS_LOAD.ALL", @@ -656,16 +597,6 @@ "MetricExpr": "1e3 * MACHINE_CLEARS.SMC / INST_RETIRED.ANY", "MetricName": "tma_info_machine_clear_bound_machine_clears_smc_pki" }, - { - "BriefDescription": "Percentage of total non-speculative loads wit= h an address aliasing block", - "MetricExpr": "tma_info_mem_exec_blocks_loads_with_adressaliasing", - "MetricName": "tma_info_mem_exec_blocks_%_loads_with_adressaliasin= g" - }, - { - "BriefDescription": "Percentage of total non-speculative loads wit= h a store forward or unknown store address block", - "MetricExpr": "tma_info_mem_exec_blocks_loads_with_storefwdblk", - "MetricName": "tma_info_mem_exec_blocks_%_loads_with_storefwdblk" - }, { "BriefDescription": "Percentage of total non-speculative loads wit= h an address aliasing block", "MetricExpr": "100 * LD_BLOCKS.ADDRESS_ALIAS / MEM_UOPS_RETIRED.AL= L_LOADS", @@ -678,31 +609,6 @@ "MetricName": "tma_info_mem_exec_blocks_loads_with_storefwdblk", "ScaleUnit": "100%" }, - { - "BriefDescription": "Percentage of Memory Execution Bound due to a= first level data cache miss", - "MetricExpr": "tma_info_mem_exec_bound_loadhead_with_l1miss", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_l1miss" - }, - { - "BriefDescription": "Percentage of Memory Execution Bound due to o= ther block cases, such as pipeline conflicts, fences, etc", - "MetricExpr": "tma_info_mem_exec_bound_loadhead_with_otherpipeline= blks", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_otherpipeli= neblks" - }, - { - "BriefDescription": "Percentage of Memory Execution Bound due to a= pagewalk", - "MetricExpr": "tma_info_mem_exec_bound_loadhead_with_pagewalk", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_pagewalk" - }, - { - "BriefDescription": "Percentage of Memory Execution Bound due to a= second level TLB miss", - "MetricExpr": "tma_info_mem_exec_bound_loadhead_with_stlbhit", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_stlbhit" - }, - { - "BriefDescription": "Percentage of Memory Execution Bound due to a= store forward address match", - "MetricExpr": "tma_info_mem_exec_bound_loadhead_with_storefwding", - "MetricName": "tma_info_mem_exec_bound_%_loadhead_with_storefwding" - }, { "BriefDescription": "Percentage of Memory Execution Bound due to a= first level data cache miss", "MetricExpr": "100 * LD_HEAD.L1_MISS_AT_RET / LD_HEAD.ANY_AT_RET", @@ -758,11 +664,6 @@ "MetricExpr": "1e3 * MEM_UOPS_RETIRED.ALL_LOADS / TOPDOWN_RETIRING= .ALL_P", "MetricName": "tma_info_mem_mix_memload_ratio" }, - { - "BriefDescription": "Percentage of time that the core is stalled d= ue to a TPAUSE or UMWAIT instruction", - "MetricExpr": "tma_info_serialization_tpause_cycles", - "MetricName": "tma_info_serialization _%_tpause_cycles" - }, { "BriefDescription": "Percentage of time that the core is stalled d= ue to a TPAUSE or UMWAIT instruction", "MetricExpr": "100 * SERIALIZATION.C01_MS_SCB / (6 * CPU_CLK_UNHAL= TED.CORE)", @@ -783,14 +684,17 @@ }, { "BriefDescription": "Fraction of cycles spent in Kernel mode", - "MetricExpr": "cpu@CPU_CLK_UNHALTED.CORE_P@k / CPU_CLK_UNHALTED.CO= RE", - "MetricGroup": "Summary", + "MetricExpr": "CPU_CLK_UNHALTED.CORE_P:k / CPU_CLK_UNHALTED.CORE", "MetricName": "tma_info_system_kernel_utilization" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.CORE_P / CPU_CLK_UNHALTED.CORE", + "MetricName": "tma_info_system_mux" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "CPU_CLK_UNHALTED.CORE / CPU_CLK_UNHALTED.REF_TSC", - "MetricGroup": "Power", "MetricName": "tma_info_system_turbo_utilization" }, { @@ -814,102 +718,90 @@ "MetricName": "tma_info_uop_mix_x87_uop_ratio" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to Instruction Table Lookaside Buffer (ITLB= ) misses.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to Instruction Table Lookaside Buffer (ITLB= ) misses", "MetricExpr": "TOPDOWN_FE_BOUND.ITLB_MISS / (6 * CPU_CLK_UNHALTED.= CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_latency_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_latency_gr= oup", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_ifetch_latency >= 0.15 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the total number of issue slots that w= ere not consumed by the backend because allocation is stalled due to a mach= ine clear (nuke) of any kind including memory ordering and memory disambigu= ation", "MetricExpr": "TOPDOWN_BAD_SPECULATION.MACHINE_CLEARS / (6 * CPU_C= LK_UNHALTED.CORE)", - "MetricGroup": "TopdownL2;tma_L2_group;tma_bad_speculation_group", + "MetricGroup": "Slots;TopdownL2;tma_L2_group;tma_bad_speculation_g= roup", "MetricName": "tma_machine_clears", - "MetricThreshold": "tma_machine_clears > 0.05 & tma_bad_speculatio= n > 0.15", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to memory reservation stalls in which a sched= uler is not able to accept uops", "MetricExpr": "TOPDOWN_BE_BOUND.MEM_SCHEDULER / (6 * CPU_CLK_UNHAL= TED.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_resource_bound_gr= oup", "MetricName": "tma_mem_scheduler", - "MetricThreshold": "tma_mem_scheduler > 0.1 & (tma_resource_bound = > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to IEC or FPC RAT stalls, which can be due to= FIQ or IEC reservation stalls in which the integer, floating point or SIMD= scheduler is not able to accept uops", "MetricExpr": "TOPDOWN_BE_BOUND.NON_MEM_SCHEDULER / (6 * CPU_CLK_U= NHALTED.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_resource_bound_gr= oup", "MetricName": "tma_non_mem_scheduler", - "MetricThreshold": "tma_non_mem_scheduler > 0.1 & (tma_resource_bo= und > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to a machine clear that requires the use of m= icrocode (slow nuke)", "MetricExpr": "TOPDOWN_BAD_SPECULATION.NUKE / (6 * CPU_CLK_UNHALTE= D.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_machine_clears_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_machine_clears_gr= oup", "MetricName": "tma_nuke", - "MetricThreshold": "tma_nuke > 0.05 & (tma_machine_clears > 0.05 &= tma_bad_speculation > 0.15)", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to other common frontend stalls not categor= ized.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to other common frontend stalls not categor= ized", "MetricExpr": "TOPDOWN_FE_BOUND.OTHER / (6 * CPU_CLK_UNHALTED.CORE= )", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_bandwidth_= group", "MetricName": "tma_other_fb", - "MetricThreshold": "tma_other_fb > 0.05 & (tma_ifetch_bandwidth > = 0.1 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { - "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to wrong predecodes.", + "BriefDescription": "Counts the number of issue slots that were no= t delivered by the frontend due to wrong predecodes", "MetricExpr": "TOPDOWN_FE_BOUND.PREDECODE / (6 * CPU_CLK_UNHALTED.= CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_ifetch_bandwidth_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_ifetch_bandwidth_= group", "MetricName": "tma_predecode", - "MetricThreshold": "tma_predecode > 0.05 & (tma_ifetch_bandwidth >= 0.1 & tma_frontend_bound > 0.2)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the physical register file unable to accep= t an entry (marble stalls)", "MetricExpr": "TOPDOWN_BE_BOUND.REGISTER / (6 * CPU_CLK_UNHALTED.C= ORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_resource_bound_gr= oup", "MetricName": "tma_register", - "MetricThreshold": "tma_register > 0.1 & (tma_resource_bound > 0.2= & tma_backend_bound > 0.1)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to the reorder buffer being full (ROB stalls)= ", "MetricExpr": "TOPDOWN_BE_BOUND.REORDER_BUFFER / (6 * CPU_CLK_UNHA= LTED.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_resource_bound_gr= oup", "MetricName": "tma_reorder_buffer", - "MetricThreshold": "tma_reorder_buffer > 0.1 & (tma_resource_bound= > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of cycles the core is stall= ed due to a resource limitation", "MetricExpr": "tma_backend_bound - tma_core_bound", - "MetricGroup": "TopdownL2;tma_L2_group;tma_backend_bound_group", + "MetricGroup": "Slots;TopdownL2;tma_L2_group;tma_backend_bound_gro= up", "MetricName": "tma_resource_bound", - "MetricThreshold": "tma_resource_bound > 0.2 & tma_backend_bound >= 0.1", "MetricgroupNoGroup": "TopdownL2", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that result = in retirement slots", "MetricExpr": "TOPDOWN_RETIRING.ALL_P / (6 * CPU_CLK_UNHALTED.CORE= )", - "MetricGroup": "TopdownL1;tma_L1_group", + "MetricGroup": "Slots;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", - "MetricThreshold": "tma_retiring > 0.75", "MetricgroupNoGroup": "TopdownL1", "ScaleUnit": "100%" }, { "BriefDescription": "Counts the number of issue slots that were no= t consumed by the backend due to scoreboards from the instruction queue (IQ= ), jump execution unit (JEU), or microcode sequencer (MS)", "MetricExpr": "TOPDOWN_BE_BOUND.SERIALIZATION / (6 * CPU_CLK_UNHAL= TED.CORE)", - "MetricGroup": "TopdownL3;tma_L3_group;tma_resource_bound_group", + "MetricGroup": "Slots;TopdownL3;tma_L3_group;tma_resource_bound_gr= oup", "MetricName": "tma_serialization", - "MetricThreshold": "tma_serialization > 0.1 & (tma_resource_bound = > 0.2 & tma_backend_bound > 0.1)", "ScaleUnit": "100%" }, { diff --git a/tools/perf/pmu-events/arch/x86/sierraforest/uncore-cache.json = b/tools/perf/pmu-events/arch/x86/sierraforest/uncore-cache.json index f37107373e3b..4406b7f28c67 100644 --- a/tools/perf/pmu-events/arch/x86/sierraforest/uncore-cache.json +++ b/tools/perf/pmu-events/arch/x86/sierraforest/uncore-cache.json @@ -1066,7 +1066,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Code read from local IA that miss the cache", + "BriefDescription": "Code read from local IA", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_CRD", @@ -1086,7 +1086,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Data read opt from local IA that miss the cac= he", + "BriefDescription": "Data read opt from local IA", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_DRD_OPT", @@ -1096,7 +1096,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Data read opt prefetch from local IA that mis= s the cache", + "BriefDescription": "Data read opt prefetch from local IA", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_DRD_OPT_PREF", @@ -1400,7 +1400,7 @@ "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_DRD_OPT_PREF_LOCAL", "PerPkg": "1", - "PublicDescription": "TOR Inserts : DRd_Opt_Prefs issued by iA Cor= es that missed the LLC", + "PublicDescription": "TOR Inserts : Data read opt prefetch from lo= cal iA that missed the LLC targeting local memory", "UMask": "0xc8a6fe01", "Unit": "CHA" }, @@ -1410,7 +1410,7 @@ "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_DRD_OPT_PREF_REMOTE", "PerPkg": "1", - "PublicDescription": "TOR Inserts : DRd_Opt_Prefs issued by iA Cor= es that missed the LLC", + "PublicDescription": "TOR Inserts : Data read opt prefetch from lo= cal iA that missed the LLC targeting remote memory", "UMask": "0xc8a77e01", "Unit": "CHA" }, @@ -1420,7 +1420,7 @@ "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_DRD_OPT_REMOTE", "PerPkg": "1", - "PublicDescription": "TOR Inserts : DRd_Opt issued by iA Cores tha= t missed the LLC", + "PublicDescription": "TOR Inserts : Data read opt from local iA th= at missed the LLC targeting remote memory", "UMask": "0xc8277e01", "Unit": "CHA" }, @@ -1595,7 +1595,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Read for ownership from local IA that miss th= e cache", + "BriefDescription": "Read for ownership from local IA that miss th= e LLC targeting local memory", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_RFO_LOCAL", @@ -1624,7 +1624,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Read for ownership prefetch from local IA tha= t miss the cache", + "BriefDescription": "Read for ownership prefetch from local IA tha= t miss the LLC targeting local memory", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_RFO_PREF_LOCAL", @@ -1634,7 +1634,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Read for ownership prefetch from local IA tha= t miss the cache", + "BriefDescription": "Read for ownership prefetch from local IA tha= t miss the LLC targeting remote memory", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_RFO_PREF_REMOTE", @@ -1644,7 +1644,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Read for ownership from local IA that miss th= e cache", + "BriefDescription": "Read for ownership from local IA that miss th= e LLC targeting remote memory", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_RFO_REMOTE", @@ -1736,7 +1736,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Read for ownership from local IA that miss th= e cache", + "BriefDescription": "Read for ownership from local IA", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_RFO", @@ -1746,7 +1746,7 @@ "Unit": "CHA" }, { - "BriefDescription": "Read for ownership prefetch from local IA tha= t miss the cache", + "BriefDescription": "Read for ownership prefetch from local IA", "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_RFO_PREF", @@ -2442,7 +2442,6 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_DRD_OPT", - "Experimental": "1", "PerPkg": "1", "PublicDescription": "TOR Occupancy : DRd_Opts issued by iA Cores = that hit the LLC", "UMask": "0xc827fd01", @@ -2453,7 +2452,6 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_DRD_OPT_PREF", - "Experimental": "1", "PerPkg": "1", "PublicDescription": "TOR Occupancy : DRd_Opt_Prefs issued by iA C= ores that hit the LLC", "UMask": "0xc8a7fd01", @@ -2663,7 +2661,6 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_OPT", - "Experimental": "1", "PerPkg": "1", "PublicDescription": "TOR Occupancy : DRd_Opt issued by iA Cores t= hat missed the LLC", "UMask": "0xc827fe01", @@ -2674,7 +2671,6 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD_OPT_PREF", - "Experimental": "1", "PerPkg": "1", "PublicDescription": "TOR Occupancy : DRd_Opt_Prefs issued by iA C= ores that missed the LLC", "UMask": "0xc8a7fe01", diff --git a/tools/perf/pmu-events/arch/x86/sierraforest/uncore-interconnec= t.json b/tools/perf/pmu-events/arch/x86/sierraforest/uncore-interconnect.js= on index 80440edac431..2ccbc8bca24e 100644 --- a/tools/perf/pmu-events/arch/x86/sierraforest/uncore-interconnect.json +++ b/tools/perf/pmu-events/arch/x86/sierraforest/uncore-interconnect.json @@ -814,6 +814,26 @@ "PerPkg": "1", "Unit": "IRP" }, + { + "BriefDescription": "Counts Timeouts - Set 0 : Fastpath Rejects", + "Counter": "0,1,2,3", + "EventCode": "0x1E", + "EventName": "UNC_I_MISC0.FAST_REJ", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x2", + "Unit": "IRP" + }, + { + "BriefDescription": "Counts Timeouts - Set 0 : Fastpath Requests", + "Counter": "0,1,2,3", + "EventCode": "0x1E", + "EventName": "UNC_I_MISC0.FAST_REQ", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x1", + "Unit": "IRP" + }, { "BriefDescription": "Misc Events - Set 1 : Lost Forward : Snoop pu= lled away ownership before a write was committed", "Counter": "0,1,2,3", @@ -824,6 +844,46 @@ "UMask": "0x10", "Unit": "IRP" }, + { + "BriefDescription": "Snoop Hit E/S responses", + "Counter": "0,1,2,3", + "EventCode": "0x12", + "EventName": "UNC_I_SNOOP_RESP.ALL_HIT_ES", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x74", + "Unit": "IRP" + }, + { + "BriefDescription": "Snoop Hit I responses", + "Counter": "0,1,2,3", + "EventCode": "0x12", + "EventName": "UNC_I_SNOOP_RESP.ALL_HIT_I", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x72", + "Unit": "IRP" + }, + { + "BriefDescription": "Snoop Hit M responses", + "Counter": "0,1,2,3", + "EventCode": "0x12", + "EventName": "UNC_I_SNOOP_RESP.ALL_HIT_M", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x78", + "Unit": "IRP" + }, + { + "BriefDescription": "Snoop miss responses", + "Counter": "0,1,2,3", + "EventCode": "0x12", + "EventName": "UNC_I_SNOOP_RESP.ALL_MISS", + "Experimental": "1", + "PerPkg": "1", + "UMask": "0x71", + "Unit": "IRP" + }, { "BriefDescription": "Inbound write (fast path) requests to coheren= t memory, received by the IRP resulting in write ownership requests issued = by IRP to the mesh.", "Counter": "0,1,2,3", @@ -1196,6 +1256,33 @@ "UMask": "0x4", "Unit": "UPI" }, + { + "BriefDescription": "Cycles in L0p", + "Counter": "0,1,2,3", + "EventCode": "0x27", + "EventName": "UNC_UPI_TxL0P_POWER_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "Unit": "UPI" + }, + { + "BriefDescription": "UNC_UPI_TxL0P_POWER_CYCLES_LL_ENTER", + "Counter": "0,1,2,3", + "EventCode": "0x28", + "EventName": "UNC_UPI_TxL0P_POWER_CYCLES_LL_ENTER", + "Experimental": "1", + "PerPkg": "1", + "Unit": "UPI" + }, + { + "BriefDescription": "UNC_UPI_TxL0P_POWER_CYCLES_M3_EXIT", + "Counter": "0,1,2,3", + "EventCode": "0x29", + "EventName": "UNC_UPI_TxL0P_POWER_CYCLES_M3_EXIT", + "Experimental": "1", + "PerPkg": "1", + "Unit": "UPI" + }, { "BriefDescription": "Matches on Transmit path of a UPI Port : Non-= Coherent Bypass", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/sierraforest/uncore-io.json b/t= ools/perf/pmu-events/arch/x86/sierraforest/uncore-io.json index cffb9d94b53d..886b99a971be 100644 --- a/tools/perf/pmu-events/arch/x86/sierraforest/uncore-io.json +++ b/tools/perf/pmu-events/arch/x86/sierraforest/uncore-io.json @@ -17,7 +17,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -29,7 +29,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -41,7 +41,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -53,7 +53,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -65,7 +65,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -77,7 +77,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -89,7 +89,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -101,7 +101,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -113,7 +113,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -125,7 +125,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff0ff", + "UMask": "0xff", "Unit": "IIO" }, { @@ -137,7 +137,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -149,7 +149,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -161,7 +161,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -173,7 +173,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -185,7 +185,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -197,7 +197,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -209,7 +209,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040040", + "UMask": "0x40", "Unit": "IIO" }, { @@ -221,7 +221,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080080", + "UMask": "0x80", "Unit": "IIO" }, { @@ -233,7 +233,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -245,7 +245,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -257,7 +257,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -269,7 +269,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -281,7 +281,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -293,7 +293,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -305,7 +305,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -317,7 +317,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -329,7 +329,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -341,7 +341,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -352,7 +352,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -363,7 +363,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -374,7 +374,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -385,7 +385,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -396,7 +396,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -407,7 +407,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -418,7 +418,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -429,7 +429,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -440,7 +440,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -451,7 +451,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -462,7 +462,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -473,7 +473,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -484,7 +484,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x02", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -495,7 +495,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x04", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -506,7 +506,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x08", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -517,7 +517,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x10", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -528,7 +528,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x20", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -539,7 +539,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x40", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -550,7 +550,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x80", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -561,7 +561,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -572,7 +572,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -583,7 +583,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x02", - "UMask": "0x7002001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -594,7 +594,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x04", - "UMask": "0x7004001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -605,7 +605,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x08", - "UMask": "0x7008001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -616,7 +616,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x10", - "UMask": "0x7010001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -627,7 +627,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x20", - "UMask": "0x7020001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -638,7 +638,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x40", - "UMask": "0x7040001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -649,7 +649,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x80", - "UMask": "0x7080001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -661,7 +661,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -673,7 +673,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -685,7 +685,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -697,7 +697,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -709,7 +709,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -721,7 +721,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -733,7 +733,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -745,7 +745,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -757,7 +757,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -769,7 +769,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -781,7 +781,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -793,7 +793,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -805,7 +805,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -817,7 +817,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -829,7 +829,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -841,7 +841,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -853,7 +853,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1129,7 +1129,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1141,7 +1141,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x700f020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -1153,7 +1153,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x700f008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1165,7 +1165,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x700f002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1177,7 +1177,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x700f001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1189,7 +1189,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x700f004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1201,7 +1201,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x700f010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -1213,7 +1213,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff080", + "UMask": "0x80", "Unit": "IIO" }, { @@ -1225,7 +1225,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff040", + "UMask": "0x40", "Unit": "IIO" }, { @@ -1237,7 +1237,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff020", + "UMask": "0x20", "Unit": "IIO" }, { @@ -1249,7 +1249,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1261,7 +1261,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1273,7 +1273,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1285,7 +1285,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff010", + "UMask": "0x10", "Unit": "IIO" }, { @@ -1297,7 +1297,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1318,7 +1318,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1329,7 +1329,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1340,7 +1340,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1351,7 +1351,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1362,7 +1362,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1373,7 +1373,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1384,7 +1384,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1395,7 +1395,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1406,7 +1406,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1417,7 +1417,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1428,7 +1428,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1439,7 +1439,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1450,7 +1450,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1461,7 +1461,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1472,7 +1472,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1483,7 +1483,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1494,7 +1494,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1505,7 +1505,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1516,7 +1516,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1527,7 +1527,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x0FF", - "UMask": "0x70ff002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1538,7 +1538,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1549,7 +1549,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1560,7 +1560,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1571,7 +1571,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1582,7 +1582,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1593,7 +1593,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1604,7 +1604,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1615,7 +1615,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080004", + "UMask": "0x4", "Unit": "IIO" }, { @@ -1626,7 +1626,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1637,7 +1637,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1648,7 +1648,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1659,7 +1659,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1670,7 +1670,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1681,7 +1681,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1692,7 +1692,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1703,7 +1703,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080001", + "UMask": "0x1", "Unit": "IIO" }, { @@ -1715,7 +1715,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1727,7 +1727,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1739,7 +1739,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1751,7 +1751,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1763,7 +1763,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1775,7 +1775,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1787,7 +1787,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1799,7 +1799,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080008", + "UMask": "0x8", "Unit": "IIO" }, { @@ -1811,7 +1811,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x001", - "UMask": "0x7001002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1823,7 +1823,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x002", - "UMask": "0x7002002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1835,7 +1835,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x004", - "UMask": "0x7004002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1847,7 +1847,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x008", - "UMask": "0x7008002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1859,7 +1859,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x010", - "UMask": "0x7010002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1871,7 +1871,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x020", - "UMask": "0x7020002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1883,7 +1883,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x040", - "UMask": "0x7040002", + "UMask": "0x2", "Unit": "IIO" }, { @@ -1895,7 +1895,7 @@ "FCMask": "0x07", "PerPkg": "1", "PortMask": "0x080", - "UMask": "0x7080002", + "UMask": "0x2", "Unit": "IIO" } ] diff --git a/tools/perf/pmu-events/arch/x86/sierraforest/uncore-memory.json= b/tools/perf/pmu-events/arch/x86/sierraforest/uncore-memory.json index 7e6e6764f181..ae9c62b32e92 100644 --- a/tools/perf/pmu-events/arch/x86/sierraforest/uncore-memory.json +++ b/tools/perf/pmu-events/arch/x86/sierraforest/uncore-memory.json @@ -169,7 +169,7 @@ "Unit": "IMC" }, { - "BriefDescription": "Number of DRAM DCLK clock cycles while the ev= ent is enabled", + "BriefDescription": "Number of DRAM DCLK clock cycles while the ev= ent is enabled. DCLK is 1/4 of DRAM data rate.", "Counter": "0,1,2,3", "EventCode": "0x01", "EventName": "UNC_M_CLOCKTICKS", @@ -188,6 +188,104 @@ "PublicDescription": "DRAM Clockticks", "Unit": "IMC" }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH0_RANK0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x1", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH0_RANK1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x2", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH0_RANK2", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x4", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH0_RANK3", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x8", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH1_RANK0", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x10", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH1_RANK1", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x20", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH1_RANK2", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x40", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e", + "Counter": "0,1,2,3", + "EventCode": "0x47", + "EventName": "UNC_M_POWERDOWN_CYCLES.SCH1_RANK3", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x80", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles a given rank is in Power Down Mod= e and all pages are closed", + "Counter": "0,1,2,3", + "EventCode": "0x88", + "EventName": "UNC_M_POWER_CHANNEL_PPD_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "Unit": "IMC" + }, { "BriefDescription": "DRAM Precharge commands. : Counts the number = of DRAM Precharge commands sent on this channel.", "Counter": "0,1,2,3", @@ -360,6 +458,28 @@ "PerPkg": "1", "Unit": "IMC" }, + { + "BriefDescription": "subevent0 - # of cycles all ranks were in SR = subevent1 - # of times all ranks went into SR subevent2 -# of times ps_sr_= active asserted (SRE) subevent3 - # of times ps_sr_active deasserted (SRX) = subevent4 - # of times PS-&>Refresh ps_sr_req asserted (SRE) subevent5 - # = of times PS-&>Refresh ps_sr_req deasserted (SRX) subevent6 - # of cycles PS= Ctrlr FSM was in FATAL", + "Counter": "0,1,2,3", + "EventCode": "0x43", + "EventName": "UNC_M_SELF_REFRESH.ENTER_SUCCESS", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "UNC_M_SELF_REFRESH.ENTER_SUCCESS", + "UMask": "0x2", + "Unit": "IMC" + }, + { + "BriefDescription": "# of cycles all ranks were in SR", + "Counter": "0,1,2,3", + "EventCode": "0x43", + "EventName": "UNC_M_SELF_REFRESH.ENTER_SUCCESS_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "-", + "UMask": "0x1", + "Unit": "IMC" + }, { "BriefDescription": "Write Pending Queue Allocations", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/sierraforest/uncore-power.json = b/tools/perf/pmu-events/arch/x86/sierraforest/uncore-power.json index 02e59f64a544..9ea852ef190e 100644 --- a/tools/perf/pmu-events/arch/x86/sierraforest/uncore-power.json +++ b/tools/perf/pmu-events/arch/x86/sierraforest/uncore-power.json @@ -7,5 +7,103 @@ "PerPkg": "1", "PublicDescription": "PCU Clockticks: The PCU runs off a fixed 1 = GHz clock. This event counts the number of pclk cycles measured while the = counter was enabled. The pclk, like the Memory Controller's dclk, counts a= t a constant rate making it a good measure of actual wall time.", "Unit": "PCU" + }, + { + "BriefDescription": "Thermal Strongest Upper Limit Cycles", + "Counter": "0,1,2,3", + "EventCode": "0x04", + "EventName": "UNC_P_FREQ_MAX_LIMIT_THERMAL_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Thermal Strongest Upper Limit Cycles : Numbe= r of cycles any frequency is reduced due to a thermal limit. Count only if= throttling is occurring.", + "Unit": "PCU" + }, + { + "BriefDescription": "Power Strongest Upper Limit Cycles", + "Counter": "0,1,2,3", + "EventCode": "0x05", + "EventName": "UNC_P_FREQ_MAX_POWER_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Power Strongest Upper Limit Cycles : Counts = the number of cycles when power is the upper limit on frequency.", + "Unit": "PCU" + }, + { + "BriefDescription": "Cycles spent changing Frequency", + "Counter": "0,1,2,3", + "EventCode": "0x74", + "EventName": "UNC_P_FREQ_TRANS_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Cycles spent changing Frequency : Counts the= number of cycles when the system is changing frequency. This can not be f= iltered by thread ID. One can also use it with the occupancy counter that = monitors number of threads in C0 to estimate the performance impact that fr= equency transitions had on the system.", + "Unit": "PCU" + }, + { + "BriefDescription": "Package C State Residency - C2E", + "Counter": "0,1,2,3", + "EventCode": "0x2b", + "EventName": "UNC_P_PKG_RESIDENCY_C2E_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Package C State Residency - C2E : Counts the= number of cycles when the package was in C2E. This event can be used in c= onjunction with edge detect to count C2E entrances (or exits using invert).= Residency events do not include transition times.", + "Unit": "PCU" + }, + { + "BriefDescription": "Package C State Residency - C6", + "Counter": "0,1,2,3", + "EventCode": "0x2d", + "EventName": "UNC_P_PKG_RESIDENCY_C6_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Package C State Residency - C6 : Counts the = number of cycles when the package was in C6. This event can be used in con= junction with edge detect to count C6 entrances (or exits using invert). R= esidency events do not include transition times.", + "Unit": "PCU" + }, + { + "BriefDescription": "Number of cores in C0", + "Counter": "0,1,2,3", + "EventCode": "0x35", + "EventName": "UNC_P_POWER_STATE_OCCUPANCY_CORES_C0", + "PerPkg": "1", + "PublicDescription": "Number of cores in C0 : This is an occupancy= event that tracks the number of cores that are in the chosen C-State. It = can be used by itself to get the average number of cores in that C-state wi= th thresholding to generate histograms, or with other PCU events and occupa= ncy triggering to capture other details.", + "Unit": "PCU" + }, + { + "BriefDescription": "Number of cores in C3", + "Counter": "0,1,2,3", + "EventCode": "0x36", + "EventName": "UNC_P_POWER_STATE_OCCUPANCY_CORES_C3", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Number of cores in C3 : This is an occupancy= event that tracks the number of cores that are in the chosen C-State. It = can be used by itself to get the average number of cores in that C-state wi= th thresholding to generate histograms, or with other PCU events and occupa= ncy triggering to capture other details.", + "Unit": "PCU" + }, + { + "BriefDescription": "Number of cores in C6", + "Counter": "0,1,2,3", + "EventCode": "0x37", + "EventName": "UNC_P_POWER_STATE_OCCUPANCY_CORES_C6", + "PerPkg": "1", + "PublicDescription": "Number of cores in C6 : This is an occupancy= event that tracks the number of cores that are in the chosen C-State. It = can be used by itself to get the average number of cores in that C-state wi= th thresholding to generate histograms, or with other PCU events and occupa= ncy triggering to capture other details.", + "Unit": "PCU" + }, + { + "BriefDescription": "External Prochot", + "Counter": "0,1,2,3", + "EventCode": "0x0a", + "EventName": "UNC_P_PROCHOT_EXTERNAL_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "External Prochot : Counts the number of cycl= es that we are in external PROCHOT mode. This mode is triggered when a sen= sor off the die determines that something off-die (like DRAM) is too hot an= d must throttle to avoid damaging the chip.", + "Unit": "PCU" + }, + { + "BriefDescription": "Internal Prochot", + "Counter": "0,1,2,3", + "EventCode": "0x09", + "EventName": "UNC_P_PROCHOT_INTERNAL_CYCLES", + "Experimental": "1", + "PerPkg": "1", + "PublicDescription": "Internal Prochot : Counts the number of cycl= es that we are in Internal PROCHOT mode. This mode is triggered when a sen= sor on the die determines that we are too hot and must throttle to avoid da= maging the chip.", + "Unit": "PCU" } ] --=20 2.48.0.rc2.279.g1de40edade-goog From nobody Sun Dec 14 13:49:51 2025 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7F8D719FA92 for ; Thu, 16 Jan 2025 06:45:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009932; cv=none; b=O7UgblGMIfvqpF6JY5XfeCz58IoAWg+SwtN1dryKGtHBfnPXoM6HDy88fW+EzWmjxQYHq0nKDAOPMqZJsG6s4NgOCDTgOWwU07wpRMexAf8McHEOQb30FNvqJh5lF7Gj/mVnL+YHXnTJylXcIz8/3wMg40qN3Hl9I0VaixukZZ4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009932; c=relaxed/simple; bh=i3yCzm6ysagwRJ3JCphciMQofdK6j5RAffaFOBSnMMg=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=DVBurr3coffHm4pqxVgMVkPNn6iLZVpzYhvHc2jZUmK/SeJiI13AMIvjZVxWjrESPSHGsTWR2oY78IU2oInFDYMHlPEIlFkJFoTjGaBTYd/+W5hh+GTSIKpqXhnvQso9qH0clauj7G3V6cend6+2wmVjuVrkAAV4sq9R/56b+3k= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=QAaVt2n+; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="QAaVt2n+" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e3c61a11a40so1631343276.2 for ; Wed, 15 Jan 2025 22:45:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737009900; x=1737614700; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=d0H04d4YjG1tVMLc812AcrISyFq5JcdvK7C+xWCQv1U=; b=QAaVt2n+1AbeUyzbX4HyEcM/YcGShQ9uz7jbHRrfLRlM6jSs8d+UxMDvsMpFx/hLkP wA80HkWMwV7BGs4+rfPcBU7V/YcMT3xkDYUA9xP/esF0iCb2nXJUwYW+4rUNzFN6CGiK VrC9Zat66i191Bz8j/UJ96zW/yUawLJCccXJTaxEBOHpi9kTAwgGma7nW9MvqVD6M+WP zWo32qRqbzVInccesFvrLDeBDumu64r9fmaQ3c6hgLv+ye+d6Xy+q395eUhGwEXeVMcO Guum3NQOa0DXIN2vvLhqtLQOQf4F2yfgoWtHrsxHxi7y6Qs83UgrGkNg8CtxnL1EOvJX MbUA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737009900; x=1737614700; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=d0H04d4YjG1tVMLc812AcrISyFq5JcdvK7C+xWCQv1U=; b=DgX94hNEPgW0V+ZqlrbKl7X9TX3inBp81ozosEhlu1pDxbGyNN9RYSbZTJJeBkOzXE c+VSOHBwxoW0xcGcy8zq8SrJWLxCXDhaBALoxAWVK+c5TqNmOj5wto49mcXjnR5J7VqI rzivnYBCEESuJAlymv3RbCF9npbNjvhOSNk0h8vtrRXz5jfopCuUvL6gY+FafkgVS0D0 DSqpMKQfVfGlaoyvkvvmqhVcj69C4gM4U/sTGLYOkaY9+UMSqQeuJxLg/heRiCvdMG28 UOywYfdEeEyj5+4AIWAv6V1wQA9f2xou7R8Y4qCJi3m/NAjYL74V2/RpFwcr48MwFf8G jvhQ== X-Forwarded-Encrypted: i=1; AJvYcCUdn2WIbmWPwdNXuLBCcReXtWewWubKeUf1gySFqzSd8K6SDzv/TuJUW/dGcRXRzBz8eZgW2k4PsN+hgEg=@vger.kernel.org X-Gm-Message-State: AOJu0Yw11DN6dUrSoH0crcrRUtDzg5IYFv57LwxaX2q0Uj147/9GJLwG D18LlIyEPwFl/mwamq80owaxxT7YUvJbKivY0uUmLEeWDUnO+fhUZo+bwtHfB9OO2u/3g+I2B4z NXX+eZA== X-Google-Smtp-Source: AGHT+IGkzfaq2Yv1JSa4uOOqyDuouulVriTCZpCIaIYzCgH037tqarKcGRkf+fjm8smRFw2zoSdz31eYY7lY X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:7bd:ce19:c701:bfa3]) (user=irogers job=sendgmr) by 2002:a05:690c:4587:b0:6ef:7c63:b104 with SMTP id 00721157ae682-6f531312b4dmr840437b3.7.1737009899969; Wed, 15 Jan 2025 22:44:59 -0800 (PST) Date: Wed, 15 Jan 2025 22:43:53 -0800 In-Reply-To: <20250116064355.344245-1-irogers@google.com> Message-Id: <20250116064355.344245-22-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250116064355.344245-1-irogers@google.com> X-Mailer: git-send-email 2.48.0.rc2.279.g1de40edade-goog Subject: [PATCH v2 21/23] perf vendor events: Update Skylake metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update TMA metrics from 4.8 to 5.01. The TMA 5.01 addition is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-developed-by: Caleb Biggers Signed-off-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- .../arch/x86/skylake/metricgroups.json | 9 +- .../arch/x86/skylake/skl-metrics.json | 684 ++++++++++-------- 2 files changed, 406 insertions(+), 287 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/skylake/metricgroups.json b/too= ls/perf/pmu-events/arch/x86/skylake/metricgroups.json index 3a88260194d1..00172bff5219 100644 --- a/tools/perf/pmu-events/arch/x86/skylake/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/skylake/metricgroups.json @@ -37,6 +37,7 @@ "InsType": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -83,7 +84,9 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_mispredicts_group": "Metrics contributing to tma_branch_mi= spredicts category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", + "tma_code_stlb_miss_group": "Metrics contributing to tma_code_stlb_mis= s category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -112,10 +115,13 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", + "tma_l2_bound_group": "Metrics contributing to tma_l2_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", "tma_load_op_utilization_group": "Metrics contributing to tma_load_op_= utilization category", + "tma_load_stlb_miss_group": "Metrics contributing to tma_load_stlb_mis= s category", "tma_machine_clears_group": "Metrics contributing to tma_machine_clear= s category", "tma_mem_latency_group": "Metrics contributing to tma_mem_latency cate= gory", "tma_memory_bound_group": "Metrics contributing to tma_memory_bound ca= tegory", @@ -128,5 +134,6 @@ "tma_retiring_group": "Metrics contributing to tma_retiring category", "tma_serializing_operation_group": "Metrics contributing to tma_serial= izing_operation category", "tma_store_bound_group": "Metrics contributing to tma_store_bound cate= gory", - "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category" + "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category", + "tma_store_stlb_miss_group": "Metrics contributing to tma_store_stlb_m= iss category" } diff --git a/tools/perf/pmu-events/arch/x86/skylake/skl-metrics.json b/tool= s/perf/pmu-events/arch/x86/skylake/skl-metrics.json index 4e954fe8547c..2a76dd01fb52 100644 --- a/tools/perf/pmu-events/arch/x86/skylake/skl-metrics.json +++ b/tools/perf/pmu-events/arch/x86/skylake/skl-metrics.json @@ -74,12 +74,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", + "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT= .PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_= info_thread_slots", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -91,7 +91,7 @@ "MetricExpr": "34 * (FP_ASSIST.ANY + OTHER_ASSISTS.ANY) / tma_info= _thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y", "ScaleUnit": "100%" }, @@ -102,7 +102,7 @@ "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound.", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound", "ScaleUnit": "100%" }, { @@ -112,9 +112,102 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%" }, + { + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Ret", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4k_alias= ing + tma_fb_full)))", + "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_4k_aliasing + tma_fb_fu= ll)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tm= a_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_lock_latency / (tma_= dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma_lock_latenc= y + tma_split_loads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * = (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_boun= d + tma_store_bound)) * (tma_split_loads / (tma_dtlb_load + tma_store_fwd_b= lk + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4= k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_= bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * = (tma_split_stores / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_dtlb_store)) + tma_memory_bound * (tma_store_bound / (tma_l1_bound= + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_= store_latency / (tma_store_latency + tma_false_sharing + tma_split_stores += tma_dtlb_store)))", + "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * tma_fetch_latency * (tma_ms_swit= ches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteer= s * (10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts)) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_b= ranches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tm= a_ms_switches + tma_lcp + tma_dsb_switches)) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Fed;FetchBW;Frontend", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clea= rs_resteers + tma_mispredicts_resteers * (10 * tma_microcode_sequencer * tm= a_other_mispredicts / tma_branch_mispredicts)) / (tma_mispredicts_resteers = + tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_i= tlb_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_swit= ches) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_m= ispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes = / tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_= bound * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_= 0) / (tma_divider + tma_serializing_operation + tma_ports_utilization) + tm= a_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequence= r) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound= / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store= _bound)) * (tma_dtlb_store / (tma_store_latency + tma_false_sharing + tma_s= plit_stores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_dtlb_store - tma_store_latency)) + tma_machine_clears * (1 - tma_o= ther_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Offcore", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Ret", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20" + }, { "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", "MetricConstraint": "NO_GROUP_EVENTS", @@ -123,7 +216,7 @@ "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredic= tions, tma_mispredicts_resteers", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost= , tma_mispredicts_resteers", "ScaleUnit": "100%" }, { @@ -131,8 +224,8 @@ "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -140,8 +233,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", "ScaleUnit": "100%" }, { @@ -149,18 +242,50 @@ "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRE= D.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / t= ma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_2M_= 4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_4K = / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(18.5 * tma_info_system_core_frequency * MEM_LOAD_L= 3_HIT_RETIRED.XSNP_HITM + 16.5 * tma_info_system_core_frequency * MEM_LOAD_= L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED= .L1_MISS / 2) / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricExpr": "((22 * tma_info_system_core_frequency - 3.5 * tma_i= nfo_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM + (20 * tma_= info_system_core_frequency - 3.5 * tma_info_system_core_frequency) * MEM_LO= AD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETI= RED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related= metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_fals= e_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { @@ -171,25 +296,25 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "16.5 * tma_info_system_core_frequency * MEM_LOAD_L3= _HIT_RETIRED.XSNP_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_= MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(20 * tma_info_system_core_frequency - 3.5 * tma_in= fo_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT * (1 + MEM_LOA= D_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT. Related metrics: tma_bottleneck_memory_synchron= ization, tma_contested_accesses, tma_false_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cpu@INS= T_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", + "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%" }, @@ -198,7 +323,7 @@ "MetricExpr": "ARITH.DIVIDER_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, @@ -208,8 +333,8 @@ "MetricExpr": "CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_thread_clk= s + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks - tma_l2_bound", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS_PS", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%" }, { @@ -218,7 +343,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", "ScaleUnit": "100%" }, { @@ -226,47 +351,47 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "min(9 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D1= @ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE= _ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricExpr": "min(9 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D0= x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYC= LE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store, tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_me= mory_synchronization", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(9 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D1@ = + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", + "MetricExpr": "(9 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D0x1= @ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, = tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_memory_synchroniz= ation", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "22 * tma_info_system_core_frequency * OFFCORE_RESPO= NSE.DEMAND_RFO.L3_HIT.SNOOP_HITM / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related= metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM, OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related m= etrics: tma_bottleneck_memory_synchronization, tma_contested_accesses, tma_= data_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D1@ / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D0x1@ / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_bottleneck_cache_memory_bandwi= dth, tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store= _latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy", "ScaleUnit": "100%" }, { @@ -276,7 +401,7 @@ "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LA= TENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_sw= itches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tm= a_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { @@ -286,17 +411,17 @@ "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or up to= ([SNB+] four; [ADL+] five) uops", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "tma_heavy_operations - tma_microcode_sequencer", "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", "MetricName": "tma_few_uops_instructions", "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or up t= o ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number o= f uops in such instructions. Related metrics: tma_decoder0_alone", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions.= Related metrics: tma_decoder0_alone", "ScaleUnit": "100%" }, { @@ -306,7 +431,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", "ScaleUnit": "100%" }, { @@ -315,7 +440,7 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", "ScaleUnit": "100%" }, { @@ -323,8 +448,8 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports= _utilized_2", "ScaleUnit": "100%" }, { @@ -333,8 +458,8 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -342,8 +467,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_p= ort_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -351,8 +476,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_port_0, tma_port_1, tma_port_5, tma_p= ort_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -362,50 +487,50 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions -- where one uop can represent mu= ltiple contiguous instructions", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", "MetricExpr": "tma_light_operations * UOPS_RETIRED.MACRO_FUSED / U= OPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_fused_instructions", "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions -- where one uop can represent m= ultiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of = legacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Ot= her_Light_Ops in MTL!)}", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", "MetricExpr": "(UOPS_RETIRED.RETIRE_SLOTS + UOPS_RETIRED.MACRO_FUS= ED - INST_RETIRED.ANY) / tma_info_thread_slots", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", - "MetricExpr": "(ICACHE_16B.IFDATA_STALL + 2 * cpu@ICACHE_16B.IFDAT= A_STALL\\,cmask\\=3D1\\,edge@) / tma_info_thread_clks", + "MetricExpr": "(ICACHE_16B.IFDATA_STALL + 2 * cpu@ICACHE_16B.IFDAT= A_STALL\\,cmask\\=3D0x1\\,edge\\=3D0x1@) / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA sl= ots wasted per non-speculative branch misprediction (retired JEClear)", - "MetricExpr": "tma_info_bottleneck_mispredictions * tma_info_threa= d_slots / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 4 / BR_MISP_RETIRED.ALL_BRANCHES / 100", "MetricGroup": "Bad;BrMispredicts;tma_issueBM", "MetricName": "tma_info_bad_spec_branch_misprediction_cost", - "PublicDescription": "Branch Misprediction Cost: Fraction of TMA s= lots wasted per non-speculative branch misprediction (retired JEClear). Rel= ated metrics: tma_branch_mispredicts, tma_info_bottleneck_mispredictions, t= ma_mispredicts_resteers" + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", "MetricExpr": "tma_info_inst_mix_instructions / (UOPS_RETIRED.RETI= RE_SLOTS / UOPS_ISSUED.ANY * BR_MISP_EXEC.INDIRECT)", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" }, { "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", @@ -415,7 +540,7 @@ "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200" }, { - "BriefDescription": "Speculative to Retired ratio of all clears (c= overing mispredicts and nukes)", + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", "MetricGroup": "BrMispredicts", "MetricName": "tma_info_bad_spec_spec_clears_ratio" @@ -430,8 +555,8 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_mite= )))", - "MetricGroup": "DSB;FetchBW;tma_issueFB", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= )))", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp" @@ -439,7 +564,7 @@ { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_mite))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -447,108 +572,10 @@ }, { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", - "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", - "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "tma_info_bottleneck_big_code", - "MetricThreshold": "tma_info_bottleneck_big_code > 20" - }, - { - "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", - "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", - "MetricGroup": "BvBO;Ret", - "MetricName": "tma_info_bottleneck_branching_overhead", - "MetricThreshold": "tma_info_bottleneck_branching_overhead > 5", - "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_= fb_full + tma_l1_hit_latency + tma_lock_latency + tma_split_loads + tma_sto= re_fwd_blk)))", - "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", - "MetricName": "tma_info_bottleneck_cache_memory_bandwidth", - "MetricThreshold": "tma_info_bottleneck_cache_memory_bandwidth > 2= 0", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_store_bound) + tma_memory_bound * (tma_store_bound / (tma_dram_bou= nd + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_= store_latency / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tm= a_store_latency)) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tm= a_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_l1_hit_= latency / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_l1_hit_laten= cy + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)))", - "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", - "MetricName": "tma_info_bottleneck_cache_memory_latency", - "MetricThreshold": "tma_info_bottleneck_cache_memory_latency > 20", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" - }, - { - "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_ports_utilization + tma_serializing_operation) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_ports_utilization + tma_serializin= g_operation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", - "MetricGroup": "BvCB;Cor;tma_issueComp", - "MetricName": "tma_info_bottleneck_compute_bound_est", - "MetricThreshold": "tma_info_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * tma_fetch_latency * (tma_ms_swit= ches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteer= s * (10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts)) / (tma_clears_resteers + tma_mispredicts_resteers + tma_unknown_b= ranches)) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + t= ma_itlb_misses + tma_lcp + tma_ms_switches)) - tma_info_bottleneck_big_code= ", - "MetricGroup": "BvFB;Fed;FetchBW;Frontend", - "MetricName": "tma_info_bottleneck_instruction_fetch_bw", - "MetricThreshold": "tma_info_bottleneck_instruction_fetch_bw > 20" - }, - { - "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clea= rs_resteers + tma_mispredicts_resteers * (10 * tma_microcode_sequencer * tm= a_other_mispredicts / tma_branch_mispredicts)) / (tma_clears_resteers + tma= _mispredicts_resteers + tma_unknown_branches)) / (tma_branch_resteers + tma= _dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_swit= ches) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_m= ispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes = / tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_= bound * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_= 0) / (tma_divider + tma_ports_utilization + tma_serializing_operation) + tm= a_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequence= r) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", - "MetricName": "tma_info_bottleneck_irregular_overhead", - "MetricThreshold": "tma_info_bottleneck_irregular_overhead > 10", - "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" - }, - { - "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_4k_aliasing + tm= a_dtlb_load + tma_fb_full + tma_l1_hit_latency + tma_lock_latency + tma_spl= it_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bound / (tma= _dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)= ) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_split_store= s + tma_store_latency)))", - "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_data_tlbs", - "MetricThreshold": "tma_info_bottleneck_memory_data_tlbs > 20", - "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store, tma_info_bottleneck_memory_synchronization" - }, - { - "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * = tma_false_sharing / (tma_dtlb_store + tma_false_sharing + tma_split_stores = + tma_store_latency - tma_store_latency)) + tma_machine_clears * (1 - tma_o= ther_nukes / tma_other_nukes))", - "MetricGroup": "BvMS;Mem;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_synchronization", - "MetricThreshold": "tma_info_bottleneck_memory_synchronization > 1= 0", - "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_dtlb_load, tma_dtlb_store, tma_info_bottleneck_me= mory_data_tlbs" - }, - { - "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", - "MetricName": "tma_info_bottleneck_mispredictions", - "MetricThreshold": "tma_info_bottleneck_mispredictions > 20", - "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" - }, - { - "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", - "MetricExpr": "100 - (tma_info_bottleneck_big_code + tma_info_bott= leneck_instruction_fetch_bw + tma_info_bottleneck_mispredictions + tma_info= _bottleneck_cache_memory_bandwidth + tma_info_bottleneck_cache_memory_laten= cy + tma_info_bottleneck_memory_data_tlbs + tma_info_bottleneck_memory_sync= hronization + tma_info_bottleneck_compute_bound_est + tma_info_bottleneck_i= rregular_overhead + tma_info_bottleneck_branching_overhead + tma_info_bottl= eneck_useful_work)", - "MetricGroup": "BvOB;Cor;Offcore", - "MetricName": "tma_info_bottleneck_other_bottlenecks", - "MetricThreshold": "tma_info_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls." - }, - { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", - "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "BvUW;Ret", - "MetricName": "tma_info_bottleneck_useful_work", - "MetricThreshold": "tma_info_bottleneck_useful_work > 20" + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5" }, { "BriefDescription": "Fraction of branches that are CALL or RET", @@ -577,7 +604,7 @@ }, { "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTE= D.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CP= U_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_thread_clks))", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tm= a_info_thread_clks)", "MetricGroup": "SMT", "MetricName": "tma_info_core_core_clks" }, @@ -605,11 +632,11 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED.VECTOR) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -622,20 +649,20 @@ "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / DSB2MITE_SWITCHE= S.COUNT", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost" }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D1\\,edge@ + 2", + "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D0x1\\,edge\\=3D0x1@ + 2", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency" }, @@ -665,7 +692,13 @@ "MetricName": "tma_info_frontend_l2mpki_code_all" }, { - "BriefDescription": "Branch instructions per taken branch.", + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, + { + "BriefDescription": "Branch instructions per taken branch", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -684,7 +717,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -692,7 +725,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -700,7 +733,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -708,7 +741,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -716,7 +749,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -756,7 +789,7 @@ }, { "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", - "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umas= k\\=3D0xF@", + "MetricExpr": "INST_RETIRED.ANY / SW_PREFETCH_ACCESS.ANY", "MetricGroup": "Prefetches", "MetricName": "tma_info_inst_mix_ipswpf", "MetricThreshold": "tma_info_inst_mix_ipswpf < 100" @@ -766,7 +799,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 9", + "MetricThreshold": "tma_info_inst_mix_iptb < 4 * 2 + 1", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp" }, { @@ -801,7 +834,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw" }, @@ -819,7 +852,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw" }, @@ -861,13 +894,13 @@ }, { "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", - "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration= _time", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", "MetricGroup": "Mem;MemoryBW;Offcore", "MetricName": "tma_info_memory_l3_cache_access_bw" }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw" }, @@ -886,7 +919,7 @@ { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency" }, { @@ -942,7 +975,7 @@ }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D1@)", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, @@ -963,18 +996,18 @@ "MetricExpr": "INST_RETIRED.ANY / (FP_ASSIST.ANY + OTHER_ASSISTS.A= NY)", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -992,15 +1025,15 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / duration_time / 1e3", + "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / tma_info_system_time / 1e3", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", - "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_bottlenec= k_cache_memory_bandwidth, tma_mem_bandwidth, tma_sq_full" + "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_bottleneck_cache_memory_ban= dwidth, tma_fb_full, tma_mem_bandwidth, tma_sq_full" }, { "BriefDescription": "Giga Floating Point Operations Per Second", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / duration_time", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / tma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" @@ -1010,13 +1043,14 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi" + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -1027,18 +1061,31 @@ }, { "BriefDescription": "Average number of parallel data read requests= to external memory", - "MetricExpr": "UNC_ARB_TRK_OCCUPANCY.DATA_READ / UNC_ARB_TRK_OCCUP= ANCY.DATA_READ@cmask\\=3D1@", + "MetricExpr": "UNC_ARB_TRK_OCCUPANCY.DATA_READ / UNC_ARB_TRK_OCCUP= ANCY.DATA_READ@cmask\\=3D0x1@", "MetricGroup": "Mem;MemoryBW;SoC", "MetricName": "tma_info_system_mem_parallel_reads", "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches" }, { "BriefDescription": "Average latency of data read request to exter= nal memory (in nanoseconds)", - "MetricExpr": "1e9 * (UNC_ARB_TRK_OCCUPANCY.DATA_READ / UNC_ARB_TR= K_REQUESTS.DATA_READ) / (tma_info_system_socket_clks / duration_time)", + "MetricExpr": "1e9 * (UNC_ARB_TRK_OCCUPANCY.DATA_READ / UNC_ARB_TR= K_REQUESTS.DATA_READ) / (tma_info_system_socket_clks / tma_info_system_time= )", "MetricGroup": "Mem;MemoryLat;SoC", "MetricName": "tma_info_system_mem_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetche= s. ([RKL+]memory-controller only)" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "power@energy\\-pkg@ * 61 / (tma_info_system_time * = 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" + }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_= UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", @@ -1051,6 +1098,13 @@ "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -1058,7 +1112,7 @@ "MetricName": "tma_info_system_turbo_utilization" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -1067,14 +1121,15 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi" + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -1100,43 +1155,52 @@ "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TA= KEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 6" + "MetricThreshold": "tma_info_thread_uptb < 4 * 1.5" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB= _HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_micr= ocode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates fraction of cyc= les with demand load accesses that hit the L1 cache", + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", - "MetricName": "tma_l1_hit_latency", - "MetricThreshold": "tma_l1_hit_latency > 0.1 & (tma_l1_bound > 0.1= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles with demand load accesses that hit the L1 cache. The short latency of = the L1 data cache may be exposed in pointer-chasing memory access patterns = as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + cpu@L1D_PEND_MISS.FB_FULL\\,cm= ask\\=3D1@) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_M= ISS) / tma_info_thread_clks)", + "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + cpu@L1D_PEND_MISS.FB_FULL\\,cm= ask\\=3D0x1@) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2= _MISS) / tma_info_thread_clks)", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT_PS", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "3.5 * tma_info_system_core_frequency * MEM_LOAD_RET= IRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) = / tma_info_thread_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, { @@ -1144,17 +1208,17 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STA= LLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "6.5 * tma_info_system_core_frequency * (MEM_LOAD_RE= TIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)= ) / tma_info_thread_clks", + "MetricExpr": "(10 * tma_info_system_core_frequency - 3.5 * tma_in= fo_system_core_frequency) * (MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRE= D.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_info_bottleneck_cache_memory_latency, tma_mem_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", "ScaleUnit": "100%" }, { @@ -1162,18 +1226,18 @@ "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", "MetricExpr": "tma_retiring - tma_heavy_operations", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", "ScaleUnit": "100%" }, { @@ -1191,7 +1255,7 @@ "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1199,15 +1263,39 @@ "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", "MetricExpr": "(12 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (9 = * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAND= ING.CYCLES_WITH_DEMAND_RFO))) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, @@ -1219,16 +1307,16 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_bottleneck_cache= _memory_bandwidth, tma_info_system_dram_bw_use, tma_sq_full", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%" }, { @@ -1236,8 +1324,8 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_info_bottleneck_cache_memory_latency, tma_l3_hit_la= tency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%" }, { @@ -1248,11 +1336,11 @@ "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_= RETIRED.ANY", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_memory_operations", @@ -1266,7 +1354,7 @@ "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", "MetricName": "tma_microcode_sequencer", "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", - "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers,= tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_clears, = tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_bottleneck_irreg= ular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_m= s_switches", "ScaleUnit": "100%" }, { @@ -1274,8 +1362,8 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL= _BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_inf= o_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_branch_mispredicts, tma_info_bad_spec_branch_misprediction_cost= , tma_info_bottleneck_mispredictions", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%" }, { @@ -1288,12 +1376,12 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", "MetricExpr": "UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY= ", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", "ScaleUnit": "100%" }, { @@ -1301,8 +1389,8 @@ "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_= clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operat= ion", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, { @@ -1311,7 +1399,7 @@ "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_non_fused_branches", "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused.", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", "ScaleUnit": "100%" }, { @@ -1319,7 +1407,7 @@ "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / UOPS_RETI= RED.RETIRE_SLOTS", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, @@ -1333,19 +1421,19 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { @@ -1354,7 +1442,7 @@ "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_= scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vecto= r_512b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_= scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_1, = tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1363,7 +1451,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vect= or_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tm= a_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vect= or_128b, tma_fp_vector_256b, tma_port_0, tma_port_5, tma_port_6, tma_ports_= utilized_2", "ScaleUnit": "100%" }, { @@ -1399,7 +1487,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_5", "MetricThreshold": "tma_port_5 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar,= tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_5. Related metrics: tma_fp_sc= alar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tm= a_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1408,7 +1496,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_6. Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector= _512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_1. Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, t= ma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1425,8 +1513,8 @@ "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL)) / tma_= info_thread_clks if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CY= CLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring = * EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", "ScaleUnit": "100%" }, { @@ -1434,8 +1522,8 @@ "MetricExpr": "EXE_ACTIVITY.EXE_BOUND_0_PORTS / tma_info_thread_cl= ks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", "ScaleUnit": "100%" }, { @@ -1443,7 +1531,7 @@ "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_1 - UOPS_EXECUTED.CO= RE_CYCLES_GE_2) / 2 if #SMT_on else EXE_ACTIVITY.1_PORTS_UTIL) / tma_info_c= ore_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, @@ -1452,16 +1540,16 @@ "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_2 - UOPS_EXECUTED.CO= RE_CYCLES_GE_3) / 2 if #SMT_on else EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_c= ore_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port= _6", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise).", + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", "MetricExpr": "(UOPS_EXECUTED.CORE_CYCLES_GE_3 / 2 if #SMT_on else= UOPS_EXECUTED.CORE_CYCLES_GE_3) / tma_info_core_core_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1479,7 +1567,7 @@ "MetricExpr": "PARTIAL_RAT_STALLS.SCOREBOARD / tma_info_thread_clk= s", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: PARTIAL_RAT_STALLS.SCOREBOARD. Related me= trics: tma_ms_switches", "ScaleUnit": "100%" }, @@ -1489,8 +1577,8 @@ "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", "ScaleUnit": "100%" }, { @@ -1498,17 +1586,17 @@ "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES. Related metrics: tma_port_4", "ScaleUnit": "100%" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on els= e OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_core_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_bottleneck_cache_memory_bandwidth, tma_info_system_dram_bw_use, = tma_mem_bandwidth", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%" }, { @@ -1516,8 +1604,8 @@ "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", "ScaleUnit": "100%" }, { @@ -1525,18 +1613,18 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_INST_RETIRED.LOCK_= LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / M= EM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS= _OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -1552,7 +1640,7 @@ "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", "ScaleUnit": "100%" }, { @@ -1560,7 +1648,31 @@ "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1568,7 +1680,7 @@ "MetricExpr": "9 * BACLEARS.ANY / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, @@ -1577,8 +1689,8 @@ "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", "ScaleUnit": "100%" }, { --=20 2.48.0.rc2.279.g1de40edade-goog From nobody Sun Dec 14 13:49:51 2025 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E44271A4F09 for ; Thu, 16 Jan 2025 06:45:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009937; cv=none; b=QSVNEoYWBzT9k1CiKvz6l/qaHrpDdYTDMxfKfqvVoSCFcmYiNiH9f9wWeg2HC/VS1nNDhb2jmB8a7WPN9yjnOR15cXbilHuRS4371UGmNWMWREie1DA6qz0q4f95RKJ3EmZPtYyQiB0gN+jciRcmTTXIhcSlYqw0xnLYZf5laa8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009937; c=relaxed/simple; bh=JM7aSwYjnoeDrG0iGwtyVaSvxiAdtGxIrE0t/EV1qyU=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=eg4Z6qjlDuB4hM5T5jh/BrTF9UrMwuE9A/JMM8wydZIRXCp7UtYvTdBMA3Volkp7HyNBJPYy1Lj/nEzR9uRGgsVIDg+zVlIxFp0hnY9XPApuXW7t+bmL9Cv5fzMQcHraQgO7fOwEyVSoutRzZaQpJKjPMLh6QlgWPbp9o056/s0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=BLEc1YDD; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="BLEc1YDD" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e578b0d2afdso1669302276.3 for ; Wed, 15 Jan 2025 22:45:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737009903; x=1737614703; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=JyRZ0jFovvtPgiWdAdy85Cnbfqsb23SuJ/st2vM5DVk=; b=BLEc1YDDN/AnL5317ckladRIzRCoGiE0LsJjzkO/l9o2lU3tNn64Xdwzp8b1MuhDU+ xTZ6i92YFSMSWhbZKtTXUyESwjn4RsqSeF2r/OrMFmfe5QJyLlI1vTUfF7mv4xB+5yT7 7cMq1L6w9c/1t4NOYwk5btGGNtvf11R2bfhJjDOQ8P/1ovjX9p2Sb9hq5n870XRQvWot m5KMoJUnospLKFjbqyv0syZMblGUg2tK5eH7bra3ui5dJPqTNQ6rRdShd8ybYdIGZ4Wl vtUlXldAbyyc/Y4Iq9ziJZi+BwQ6QOHncNhAia6MJ3DRLOschKcTpfZfa3nda8cDiGvt v9fw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737009903; x=1737614703; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=JyRZ0jFovvtPgiWdAdy85Cnbfqsb23SuJ/st2vM5DVk=; b=wSIpnliEbLyBn3fjbt7YQySGSNpsoNv8KNZRnLomNI/0SoubFFeHriFNnGdMPXLp3F 5LkmVITjZ854KsgzYWkMe/eDAd14XYw7RhuMC5DttvR+nvFiNNQ0I8cgoT6h5QMAHvqF nqzHDTZqq8Mh3XmEFAQFVAPLmkvBMt+np5TRToGBmApqQDnsw+PJKkwSwvty2EVlh/t6 Ct8a4C1SNLxuv3PswkRQJmNi/ZU5gXWHSKRnIF5bNgonGfKfcJti2VKXvRH0oTgU+Caq aQ3YqXCpIX22ts07pOFIcg+pv/WBDsNteJb5XTOzGLZGyaGWV4M/QDCkNBe+G9rSLnGO Vt1A== X-Forwarded-Encrypted: i=1; AJvYcCVEiS5UM2FttzrXtld+GChz6FMFeFqZuM1iwtdGoizXAQd4CebEKlo6qcvTvn2iGQasFJhI055igOQkw6Y=@vger.kernel.org X-Gm-Message-State: AOJu0Yy8s0RK7bLSO6ckzsQrKldvN8eUe0MnwbiJGD1LMT/Sa9NzQzuX wJ9jhTYEoACRdBPYVPCTtLMNFFzG5g3/KRZpVbVoyExN5Re8XjrVAcOgBySvftcBLC42X0QF2iA HzPz3xg== X-Google-Smtp-Source: AGHT+IFS5cEtsf3psLGfPedPyVjtG2dhSVm//Ir/94gmcm2mSCmMonUQc0U6hrM1VxlhcnUUHfolHnKy6ZBL X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:7bd:ce19:c701:bfa3]) (user=irogers job=sendgmr) by 2002:a81:8d44:0:b0:6ef:3402:f56b with SMTP id 00721157ae682-6f698532944mr327587b3.1.1737009902628; Wed, 15 Jan 2025 22:45:02 -0800 (PST) Date: Wed, 15 Jan 2025 22:43:54 -0800 In-Reply-To: <20250116064355.344245-1-irogers@google.com> Message-Id: <20250116064355.344245-23-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250116064355.344245-1-irogers@google.com> X-Mailer: git-send-email 2.48.0.rc2.279.g1de40edade-goog Subject: [PATCH v2 22/23] perf vendor events: Update SkylakeX events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.35 to v1.36. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v1.36: https://github.com/intel/perfmon/commit/f6801e5c145406f355f40e1746f836eaa14= 26cf9 The TMA 5.01 addition is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-developed-by: Caleb Biggers Signed-off-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- .../arch/x86/skylakex/metricgroups.json | 9 +- .../arch/x86/skylakex/skx-metrics.json | 740 ++++++++++-------- .../arch/x86/skylakex/uncore-cache.json | 60 +- .../x86/skylakex/uncore-interconnect.json | 11 - 5 files changed, 465 insertions(+), 357 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 344ca1873553..20c0a434de46 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -32,7 +32,7 @@ GenuineIntel-6-8F,v1.25,sapphirerapids,core GenuineIntel-6-AF,v1.07,sierraforest,core GenuineIntel-6-(37|4A|4C|4D|5A),v15,silvermont,core GenuineIntel-6-(4E|5E|8E|9E|A5|A6),v59,skylake,core -GenuineIntel-6-55-[01234],v1.35,skylakex,core +GenuineIntel-6-55-[01234],v1.36,skylakex,core GenuineIntel-6-86,v1.23,snowridgex,core GenuineIntel-6-8[CD],v1.16,tigerlake,core GenuineIntel-6-2C,v5,westmereep-dp,core diff --git a/tools/perf/pmu-events/arch/x86/skylakex/metricgroups.json b/to= ols/perf/pmu-events/arch/x86/skylakex/metricgroups.json index cccfcab3425e..a579603f720b 100644 --- a/tools/perf/pmu-events/arch/x86/skylakex/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/skylakex/metricgroups.json @@ -38,6 +38,7 @@ "IoBW": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -84,7 +85,9 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_mispredicts_group": "Metrics contributing to tma_branch_mi= spredicts category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", + "tma_code_stlb_miss_group": "Metrics contributing to tma_code_stlb_mis= s category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -113,10 +116,13 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", + "tma_l2_bound_group": "Metrics contributing to tma_l2_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", "tma_load_op_utilization_group": "Metrics contributing to tma_load_op_= utilization category", + "tma_load_stlb_miss_group": "Metrics contributing to tma_load_stlb_mis= s category", "tma_machine_clears_group": "Metrics contributing to tma_machine_clear= s category", "tma_mem_latency_group": "Metrics contributing to tma_mem_latency cate= gory", "tma_memory_bound_group": "Metrics contributing to tma_memory_bound ca= tegory", @@ -129,5 +135,6 @@ "tma_retiring_group": "Metrics contributing to tma_retiring category", "tma_serializing_operation_group": "Metrics contributing to tma_serial= izing_operation category", "tma_store_bound_group": "Metrics contributing to tma_store_bound cate= gory", - "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category" + "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category", + "tma_store_stlb_miss_group": "Metrics contributing to tma_store_stlb_m= iss category" } diff --git a/tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json b/too= ls/perf/pmu-events/arch/x86/skylakex/skx-metrics.json index e5e86892d7bb..2fe630cd4927 100644 --- a/tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json +++ b/tools/perf/pmu-events/arch/x86/skylakex/skx-metrics.json @@ -55,7 +55,7 @@ "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles.", + "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles", "MetricExpr": "CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY", "MetricName": "cpi", "ScaleUnit": "1per_instr" @@ -76,31 +76,31 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte page sizes) caused by demand data loads to the total number of c= ompleted instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRE= D.ANY", "MetricName": "dtlb_2mb_large_page_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte page sizes) caused by demand data loads to the total number of = completed instructions. This implies it missed in the Data Translation Look= aside Buffer (DTLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte page sizes) caused by demand data loads to the total number of = completed instructions. This implies it missed in the Data Translation Look= aside Buffer (DTLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data loads to the total number of complete= d instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "dtlb_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data stores to the total number of complet= ed instructions", "MetricExpr": "DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", "MetricName": "dtlb_store_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB", "ScaleUnit": "1per_instr" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU.", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU", "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART0 + UNC_IIO_D= ATA_REQ_OF_CPU.MEM_READ.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART2 + UN= C_IIO_DATA_REQ_OF_CPU.MEM_READ.PART3) * 4 / 1e6 / duration_time", "MetricName": "io_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU", "MetricExpr": "(UNC_IIO_PAYLOAD_BYTES_IN.MEM_WRITE.PART0 + UNC_IIO= _PAYLOAD_BYTES_IN.MEM_WRITE.PART1 + UNC_IIO_PAYLOAD_BYTES_IN.MEM_WRITE.PART= 2 + UNC_IIO_PAYLOAD_BYTES_IN.MEM_WRITE.PART3) * 4 / 1e6 / duration_time", "MetricName": "io_bandwidth_write", "ScaleUnit": "1MB/s" @@ -109,14 +109,14 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total n= umber of completed instructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY= ", "MetricName": "itlb_large_page_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by a code fetch to the total number of completed ins= tructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "itlb_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB", "ScaleUnit": "1per_instr" }, { @@ -192,25 +192,25 @@ "ScaleUnit": "1per_instr" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to local memory.", + "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to local memory", "MetricExpr": "UNC_CHA_REQUESTS.READS_LOCAL * 64 / 1e6 / duration_= time", "MetricName": "llc_miss_local_memory_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to local memory.", + "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to local memory", "MetricExpr": "UNC_CHA_REQUESTS.WRITES_LOCAL * 64 / 1e6 / duration= _time", "MetricName": "llc_miss_local_memory_bandwidth_write", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to remote memory.", + "BriefDescription": "Bandwidth (MB/sec) of read requests that miss= the last level cache (LLC) and go to remote memory", "MetricExpr": "UNC_CHA_REQUESTS.READS_REMOTE * 64 / 1e6 / duration= _time", "MetricName": "llc_miss_remote_memory_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to remote memory.", + "BriefDescription": "Bandwidth (MB/sec) of write requests that mis= s the last level cache (LLC) and go to remote memory", "MetricExpr": "UNC_CHA_REQUESTS.WRITES_REMOTE * 64 / 1e6 / duratio= n_time", "MetricName": "llc_miss_remote_memory_bandwidth_write", "ScaleUnit": "1MB/s" @@ -240,13 +240,13 @@ "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches.", + "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches", "MetricExpr": "cha@UNC_CHA_TOR_INSERTS.IA_MISS\\,config1\\=3D0x404= 32@ / (cha@UNC_CHA_TOR_INSERTS.IA_MISS\\,config1\\=3D0x40432@ + cha@UNC_CHA= _TOR_INSERTS.IA_MISS\\,config1\\=3D0x40431@)", "MetricName": "numa_reads_addressed_to_local_dram", "ScaleUnit": "100%" }, { - "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches.", + "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches", "MetricExpr": "cha@UNC_CHA_TOR_INSERTS.IA_MISS\\,config1\\=3D0x404= 31@ / (cha@UNC_CHA_TOR_INSERTS.IA_MISS\\,config1\\=3D0x40432@ + cha@UNC_CHA= _TOR_INSERTS.IA_MISS\\,config1\\=3D0x40431@)", "MetricName": "numa_reads_addressed_to_remote_dram", "ScaleUnit": "100%" @@ -295,12 +295,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", + "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT= .PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_= info_thread_slots", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -312,7 +312,7 @@ "MetricExpr": "34 * (FP_ASSIST.ANY + OTHER_ASSISTS.ANY) / tma_info= _thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y", "ScaleUnit": "100%" }, @@ -323,7 +323,7 @@ "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound.", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound", "ScaleUnit": "100%" }, { @@ -333,9 +333,102 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%" }, + { + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Ret", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4k_alias= ing + tma_fb_full)))", + "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_4k_aliasing + tma_fb_fu= ll)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tm= a_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_lock_latency / (tma_= dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma_lock_latenc= y + tma_split_loads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * = (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_boun= d + tma_store_bound)) * (tma_split_loads / (tma_dtlb_load + tma_store_fwd_b= lk + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4= k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_= bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * = (tma_split_stores / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_dtlb_store)) + tma_memory_bound * (tma_store_bound / (tma_l1_bound= + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_= store_latency / (tma_store_latency + tma_false_sharing + tma_split_stores += tma_dtlb_store)))", + "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * tma_fetch_latency * (tma_ms_swit= ches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteer= s * (10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts)) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_b= ranches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tm= a_ms_switches + tma_lcp + tma_dsb_switches)) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Fed;FetchBW;Frontend", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clea= rs_resteers + tma_mispredicts_resteers * (10 * tma_microcode_sequencer * tm= a_other_mispredicts / tma_branch_mispredicts)) / (tma_mispredicts_resteers = + tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_i= tlb_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_swit= ches) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_m= ispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes = / tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_= bound * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_= 0) / (tma_divider + tma_serializing_operation + tma_ports_utilization) + tm= a_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequence= r) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound= / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store= _bound)) * (tma_dtlb_store / (tma_store_latency + tma_false_sharing + tma_s= plit_stores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) * tma_remote_cach= e / (tma_local_mem + tma_remote_mem + tma_remote_cache) + tma_l3_bound / (t= ma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_boun= d) * (tma_contested_accesses + tma_data_sharing) / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / = (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bo= und) * tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_spl= it_stores + tma_dtlb_store - tma_store_latency)) + tma_machine_clears * (1 = - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears, tma_remote_cache" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Offcore", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Ret", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20" + }, { "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Branch Misprediction", "MetricConstraint": "NO_GROUP_EVENTS", @@ -344,7 +437,7 @@ "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredic= tions, tma_mispredicts_resteers", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost= , tma_mispredicts_resteers", "ScaleUnit": "100%" }, { @@ -352,8 +445,8 @@ "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -361,8 +454,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", "ScaleUnit": "100%" }, { @@ -370,18 +463,50 @@ "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRE= D.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / t= ma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_2M_= 4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_4K = / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(44 * tma_info_system_core_frequency * (MEM_LOAD_L3= _HIT_RETIRED.XSNP_HITM * (OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.HITM_OTHER= _CORE / (OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE + OFFCORE_R= ESPONSE.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) + 44 * tma_info_system_= core_frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_LOAD_RETIRED= .FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricExpr": "((47.5 * tma_info_system_core_frequency - 3.5 * tma= _info_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (OFFCOR= E_RESPONSE.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE / (OFFCORE_RESPONSE.DEMAND= _DATA_RD.L3_HIT.HITM_OTHER_CORE + OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.SN= OOP_HIT_WITH_FWD))) + (47.5 * tma_info_system_core_frequency - 3.5 * tma_in= fo_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS) * (1 + MEM_L= OAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related= metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_fals= e_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { @@ -392,25 +517,25 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "44 * tma_info_system_core_frequency * (MEM_LOAD_L3_= HIT_RETIRED.XSNP_HIT + MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM * (1 - OFFCORE_RES= PONSE.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE / (OFFCORE_RESPONSE.DEMAND_DATA= _RD.L3_HIT.HITM_OTHER_CORE + OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.SNOOP_H= IT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / = 2) / tma_info_thread_clks", + "MetricExpr": "(47.5 * tma_info_system_core_frequency - 3.5 * tma_= info_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_= L3_HIT_RETIRED.XSNP_HITM * (1 - OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.HITM= _OTHER_CORE / (OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE + OFF= CORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_R= ETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT. Related metrics: tma_bottleneck_memory_synchron= ization, tma_contested_accesses, tma_false_sharing, tma_machine_clears, tma= _remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cpu@INS= T_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", + "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%" }, @@ -419,7 +544,7 @@ "MetricExpr": "ARITH.DIVIDER_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, @@ -429,8 +554,8 @@ "MetricExpr": "CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_thread_clk= s + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks - tma_l2_bound", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS_PS", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%" }, { @@ -439,7 +564,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", "ScaleUnit": "100%" }, { @@ -447,47 +572,47 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "min(9 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D1= @ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE= _ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricExpr": "min(9 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D0= x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYC= LE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store, tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_me= mory_synchronization", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(9 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D1@ = + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", + "MetricExpr": "(9 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D0x1= @ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, = tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_memory_synchroniz= ation", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(110 * tma_info_system_core_frequency * (OFFCORE_RE= SPONSE.DEMAND_RFO.L3_MISS.REMOTE_HITM + OFFCORE_RESPONSE.PF_L2_RFO.L3_MISS.= REMOTE_HITM) + 47.5 * tma_info_system_core_frequency * (OFFCORE_RESPONSE.DE= MAND_RFO.L3_HIT.HITM_OTHER_CORE + OFFCORE_RESPONSE.PF_L2_RFO.L3_HIT.HITM_OT= HER_CORE)) / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related= metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM, OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.HITM_OTHER_CORE. Rela= ted metrics: tma_bottleneck_memory_synchronization, tma_contested_accesses,= tma_data_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D1@ / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D0x1@ / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_bottleneck_cache_memory_bandwi= dth, tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store= _latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy", "ScaleUnit": "100%" }, { @@ -497,7 +622,7 @@ "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LA= TENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_sw= itches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tm= a_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { @@ -507,17 +632,17 @@ "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or up to= ([SNB+] four; [ADL+] five) uops", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "tma_heavy_operations - tma_microcode_sequencer", "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", "MetricName": "tma_few_uops_instructions", "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or up t= o ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number o= f uops in such instructions. Related metrics: tma_decoder0_alone", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions.= Related metrics: tma_decoder0_alone", "ScaleUnit": "100%" }, { @@ -527,7 +652,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", "ScaleUnit": "100%" }, { @@ -536,7 +661,7 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", "ScaleUnit": "100%" }, { @@ -544,17 +669,17 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric approximates arithmetic floating-= point (FP) vector uops fraction the CPU has retired aggregated across all v= ector widths", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umas= k\\=3D0xfc@ / UOPS_RETIRED.RETIRE_SLOTS", + "MetricExpr": "cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umas= k\\=3D0xFC@ / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -563,8 +688,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -572,8 +697,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -581,7 +706,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.512B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_512b", - "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -592,50 +717,50 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions -- where one uop can represent mu= ltiple contiguous instructions", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring fused instructions , where one uop can represent mul= tiple contiguous instructions", "MetricExpr": "tma_light_operations * UOPS_RETIRED.MACRO_FUSED / U= OPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_fused_instructions", "MetricThreshold": "tma_fused_instructions > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions -- where one uop can represent m= ultiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of = legacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Ot= her_Light_Ops in MTL!)}", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring fused instructions , where one uop can represent mu= ltiple contiguous instructions. CMP+JCC or DEC+JCC are common examples of l= egacy fusions. {([MTL] Note new MOV+OP and Load+OP fusions appear under Oth= er_Light_Ops in MTL!)}", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", "MetricExpr": "(UOPS_RETIRED.RETIRE_SLOTS + UOPS_RETIRED.MACRO_FUS= ED - INST_RETIRED.ANY) / tma_info_thread_slots", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", - "MetricExpr": "(ICACHE_16B.IFDATA_STALL + 2 * cpu@ICACHE_16B.IFDAT= A_STALL\\,cmask\\=3D1\\,edge@) / tma_info_thread_clks", + "MetricExpr": "(ICACHE_16B.IFDATA_STALL + 2 * cpu@ICACHE_16B.IFDAT= A_STALL\\,cmask\\=3D0x1\\,edge\\=3D0x1@) / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA sl= ots wasted per non-speculative branch misprediction (retired JEClear)", - "MetricExpr": "tma_info_bottleneck_mispredictions * tma_info_threa= d_slots / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 4 / BR_MISP_RETIRED.ALL_BRANCHES / 100", "MetricGroup": "Bad;BrMispredicts;tma_issueBM", "MetricName": "tma_info_bad_spec_branch_misprediction_cost", - "PublicDescription": "Branch Misprediction Cost: Fraction of TMA s= lots wasted per non-speculative branch misprediction (retired JEClear). Rel= ated metrics: tma_branch_mispredicts, tma_info_bottleneck_mispredictions, t= ma_mispredicts_resteers" + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", "MetricExpr": "tma_info_inst_mix_instructions / (UOPS_RETIRED.RETI= RE_SLOTS / UOPS_ISSUED.ANY * BR_MISP_EXEC.INDIRECT)", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" }, { "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", @@ -645,7 +770,7 @@ "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200" }, { - "BriefDescription": "Speculative to Retired ratio of all clears (c= overing mispredicts and nukes)", + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", "MetricGroup": "BrMispredicts", "MetricName": "tma_info_bad_spec_spec_clears_ratio" @@ -660,8 +785,8 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_mite= )))", - "MetricGroup": "DSB;FetchBW;tma_issueFB", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= )))", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp" @@ -669,7 +794,7 @@ { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_mite))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -677,108 +802,10 @@ }, { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", - "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", - "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "tma_info_bottleneck_big_code", - "MetricThreshold": "tma_info_bottleneck_big_code > 20" - }, - { - "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", - "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", - "MetricGroup": "BvBO;Ret", - "MetricName": "tma_info_bottleneck_branching_overhead", - "MetricThreshold": "tma_info_bottleneck_branching_overhead > 5", - "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_= fb_full + tma_l1_hit_latency + tma_lock_latency + tma_split_loads + tma_sto= re_fwd_blk)))", - "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", - "MetricName": "tma_info_bottleneck_cache_memory_bandwidth", - "MetricThreshold": "tma_info_bottleneck_cache_memory_bandwidth > 2= 0", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_store_bound) + tma_memory_bound * (tma_store_bound / (tma_dram_bou= nd + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_= store_latency / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tm= a_store_latency)) + tma_memory_bound * (tma_l1_bound / (tma_dram_bound + tm= a_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_l1_hit_= latency / (tma_4k_aliasing + tma_dtlb_load + tma_fb_full + tma_l1_hit_laten= cy + tma_lock_latency + tma_split_loads + tma_store_fwd_blk)))", - "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", - "MetricName": "tma_info_bottleneck_cache_memory_latency", - "MetricThreshold": "tma_info_bottleneck_cache_memory_latency > 20", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" - }, - { - "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_ports_utilization + tma_serializing_operation) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_ports_utilization + tma_serializin= g_operation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", - "MetricGroup": "BvCB;Cor;tma_issueComp", - "MetricName": "tma_info_bottleneck_compute_bound_est", - "MetricThreshold": "tma_info_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * tma_fetch_latency * (tma_ms_swit= ches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteer= s * (10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts)) / (tma_clears_resteers + tma_mispredicts_resteers + tma_unknown_b= ranches)) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + t= ma_itlb_misses + tma_lcp + tma_ms_switches)) - tma_info_bottleneck_big_code= ", - "MetricGroup": "BvFB;Fed;FetchBW;Frontend", - "MetricName": "tma_info_bottleneck_instruction_fetch_bw", - "MetricThreshold": "tma_info_bottleneck_instruction_fetch_bw > 20" - }, - { - "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clea= rs_resteers + tma_mispredicts_resteers * (10 * tma_microcode_sequencer * tm= a_other_mispredicts / tma_branch_mispredicts)) / (tma_clears_resteers + tma= _mispredicts_resteers + tma_unknown_branches)) / (tma_branch_resteers + tma= _dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_swit= ches) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_m= ispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes = / tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_= bound * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_= 0) / (tma_divider + tma_ports_utilization + tma_serializing_operation) + tm= a_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequence= r) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", - "MetricName": "tma_info_bottleneck_irregular_overhead", - "MetricThreshold": "tma_info_bottleneck_irregular_overhead > 10", - "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" - }, - { - "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_4k_aliasing + tm= a_dtlb_load + tma_fb_full + tma_l1_hit_latency + tma_lock_latency + tma_spl= it_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bound / (tma= _dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)= ) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_split_store= s + tma_store_latency)))", - "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_data_tlbs", - "MetricThreshold": "tma_info_bottleneck_memory_data_tlbs > 20", - "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store, tma_info_bottleneck_memory_synchronization" - }, - { - "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * = (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) * tma_remote_cach= e / (tma_local_mem + tma_remote_cache + tma_remote_mem) + tma_l3_bound / (t= ma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_boun= d) * (tma_contested_accesses + tma_data_sharing) / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / = (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bo= und) * tma_false_sharing / (tma_dtlb_store + tma_false_sharing + tma_split_= stores + tma_store_latency - tma_store_latency)) + tma_machine_clears * (1 = - tma_other_nukes / tma_other_nukes))", - "MetricGroup": "BvMS;Mem;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_synchronization", - "MetricThreshold": "tma_info_bottleneck_memory_synchronization > 1= 0", - "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_dtlb_load, tma_dtlb_store, tma_info_bottleneck_me= mory_data_tlbs" - }, - { - "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", - "MetricName": "tma_info_bottleneck_mispredictions", - "MetricThreshold": "tma_info_bottleneck_mispredictions > 20", - "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" - }, - { - "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", - "MetricExpr": "100 - (tma_info_bottleneck_big_code + tma_info_bott= leneck_instruction_fetch_bw + tma_info_bottleneck_mispredictions + tma_info= _bottleneck_cache_memory_bandwidth + tma_info_bottleneck_cache_memory_laten= cy + tma_info_bottleneck_memory_data_tlbs + tma_info_bottleneck_memory_sync= hronization + tma_info_bottleneck_compute_bound_est + tma_info_bottleneck_i= rregular_overhead + tma_info_bottleneck_branching_overhead + tma_info_bottl= eneck_useful_work)", - "MetricGroup": "BvOB;Cor;Offcore", - "MetricName": "tma_info_bottleneck_other_bottlenecks", - "MetricThreshold": "tma_info_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls." - }, - { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", - "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "BvUW;Ret", - "MetricName": "tma_info_bottleneck_useful_work", - "MetricThreshold": "tma_info_bottleneck_useful_work > 20" + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5" }, { "BriefDescription": "Fraction of branches that are CALL or RET", @@ -807,7 +834,7 @@ }, { "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTE= D.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CP= U_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_thread_clks))", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tm= a_info_thread_clks)", "MetricGroup": "SMT", "MetricName": "tma_info_core_core_clks" }, @@ -832,14 +859,14 @@ }, { "BriefDescription": "Actual per-core usage of the Floating Point n= on-X87 execution units (regardless of precision or vector-width)", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + cpu@FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE\\,umask\\=3D0xfc@) / (2 * tma_info_core_core_clks= )", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + cpu@FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE\\,umask\\=3D0xFC@) / (2 * tma_info_core_core_clks= )", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -852,20 +879,20 @@ "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / DSB2MITE_SWITCHE= S.COUNT", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost" }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D1\\,edge@ + 2", + "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D0x1\\,edge\\=3D0x1@ + 2", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency" }, @@ -895,7 +922,13 @@ "MetricName": "tma_info_frontend_l2mpki_code_all" }, { - "BriefDescription": "Branch instructions per taken branch.", + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, + { + "BriefDescription": "Branch instructions per taken branch", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -910,11 +943,11 @@ { "BriefDescription": "Instructions per FP Arithmetic instruction (l= ower number means higher occurrence rate)", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=3D0xfc@)", + "MetricExpr": "INST_RETIRED.ANY / (FP_ARITH_INST_RETIRED.SCALAR + = cpu@FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE\\,umask\\=3D0xFC@)", "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -922,7 +955,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -930,7 +963,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit in= struction (lower number means higher occurrence rate)", @@ -938,7 +971,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx512", "MetricThreshold": "tma_info_inst_mix_iparith_avx512 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -946,7 +979,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -954,7 +987,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -994,7 +1027,7 @@ }, { "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", - "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umas= k\\=3D0xF@", + "MetricExpr": "INST_RETIRED.ANY / SW_PREFETCH_ACCESS.ANY", "MetricGroup": "Prefetches", "MetricName": "tma_info_inst_mix_ipswpf", "MetricThreshold": "tma_info_inst_mix_ipswpf < 100" @@ -1004,7 +1037,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 9", + "MetricThreshold": "tma_info_inst_mix_iptb < 4 * 2 + 1", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp" }, { @@ -1051,7 +1084,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw" }, @@ -1069,7 +1102,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw" }, @@ -1111,13 +1144,13 @@ }, { "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", - "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration= _time", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", "MetricGroup": "Mem;MemoryBW;Offcore", "MetricName": "tma_info_memory_l3_cache_access_bw" }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw" }, @@ -1136,7 +1169,7 @@ { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency" }, { @@ -1192,7 +1225,7 @@ }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D1@)", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, @@ -1213,18 +1246,18 @@ "MetricExpr": "INST_RETIRED.ANY / (FP_ASSIST.ANY + OTHER_ASSISTS.A= NY)", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -1242,29 +1275,29 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / duration_time", + "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / tma_info_system_time", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", - "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_bottlenec= k_cache_memory_bandwidth, tma_mem_bandwidth, tma_sq_full" + "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_bottleneck_cache_memory_ban= dwidth, tma_fb_full, tma_mem_bandwidth, tma_sq_full" }, { "BriefDescription": "Giga Floating Point Operations Per Second", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.8_FLOPS + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1= e9 / duration_time", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.8_FLOPS + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1= e9 / tma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" }, { "BriefDescription": "Average IO (network or disk) Bandwidth Use fo= r Reads [GB / sec]", - "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART0 + UNC_IIO_= DATA_REQ_OF_CPU.MEM_WRITE.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART2 += UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART3) * 4 / 1e9 / duration_time", + "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART0 + UNC_IIO_= DATA_REQ_OF_CPU.MEM_WRITE.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART2 += UNC_IIO_DATA_REQ_OF_CPU.MEM_WRITE.PART3) * 4 / 1e9 / tma_info_system_time", "MetricGroup": "IoBW;MemOffcore;Server;SoC", "MetricName": "tma_info_system_io_read_bw", "PublicDescription": "Average IO (network or disk) Bandwidth Use f= or Reads [GB / sec]. Bandwidth of IO reads that are initiated by end device= controllers that are requesting memory from the CPU" }, { "BriefDescription": "Average IO (network or disk) Bandwidth Use fo= r Writes [GB / sec]", - "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART0 + UNC_IIO_D= ATA_REQ_OF_CPU.MEM_READ.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART2 + UN= C_IIO_DATA_REQ_OF_CPU.MEM_READ.PART3) * 4 / 1e9 / duration_time", + "MetricExpr": "(UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART0 + UNC_IIO_D= ATA_REQ_OF_CPU.MEM_READ.PART1 + UNC_IIO_DATA_REQ_OF_CPU.MEM_READ.PART2 + UN= C_IIO_DATA_REQ_OF_CPU.MEM_READ.PART3) * 4 / 1e9 / tma_info_system_time", "MetricGroup": "IoBW;MemOffcore;Server;SoC", "MetricName": "tma_info_system_io_write_bw", "PublicDescription": "Average IO (network or disk) Bandwidth Use f= or Writes [GB / sec]. Bandwidth of IO writes that are initiated by end devi= ce controllers that are writing memory to the CPU" @@ -1274,13 +1307,14 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi" + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -1298,24 +1332,37 @@ }, { "BriefDescription": "Average number of parallel data read requests= to external memory", - "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_TOR_OCC= UPANCY.IA_MISS_DRD@thresh\\=3D1@", + "MetricExpr": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / cha@UNC_CHA_TOR= _OCCUPANCY.IA_MISS_DRD\\,thresh\\=3D0x1@", "MetricGroup": "Mem;MemoryBW;SoC", "MetricName": "tma_info_system_mem_parallel_reads", "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches" }, { "BriefDescription": "Average latency of data read request to exter= nal memory (in nanoseconds)", - "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_= TOR_INSERTS.IA_MISS_DRD) / (tma_info_system_socket_clks / duration_time)", + "MetricExpr": "1e9 * (UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD / UNC_CHA_= TOR_INSERTS.IA_MISS_DRD) / (tma_info_system_socket_clks / tma_info_system_t= ime)", "MetricGroup": "Mem;MemoryLat;SoC", "MetricName": "tma_info_system_mem_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetche= s. ([RKL+]memory-controller only)" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "(power@energy\\-pkg@ * 61 + 15.6 * power@energy\\-r= am@) / (duration_time * 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" + }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for baseline license level 0", "MetricExpr": "(CORE_POWER.LVL0_TURBO_LICENSE / 2 / tma_info_core_= core_clks if #SMT_on else CORE_POWER.LVL0_TURBO_LICENSE / tma_info_core_cor= e_clks)", "MetricGroup": "Power", "MetricName": "tma_info_system_power_license0_utilization", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes" }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 1", @@ -1323,7 +1370,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license1_utilization", "MetricThreshold": "tma_info_system_power_license1_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions" }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 2 (introduced in SKX)", @@ -1331,7 +1378,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license2_utilization", "MetricThreshold": "tma_info_system_power_license2_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions" }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", @@ -1345,6 +1392,13 @@ "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -1353,12 +1407,12 @@ }, { "BriefDescription": "Measured Average Uncore Frequency for the SoC= [GHz]", - "MetricExpr": "tma_info_system_socket_clks / 1e9 / duration_time", + "MetricExpr": "tma_info_system_socket_clks / 1e9 / tma_info_system= _time", "MetricGroup": "SoC", "MetricName": "tma_info_system_uncore_frequency" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -1367,14 +1421,15 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi" + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -1400,43 +1455,52 @@ "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TA= KEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 6" + "MetricThreshold": "tma_info_thread_uptb < 4 * 1.5" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB= _HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_micr= ocode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates fraction of cyc= les with demand load accesses that hit the L1 cache", + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", - "MetricName": "tma_l1_hit_latency", - "MetricThreshold": "tma_l1_hit_latency > 0.1 & (tma_l1_bound > 0.1= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles with demand load accesses that hit the L1 cache. The short latency of = the L1 data cache may be exposed in pointer-chasing memory access patterns = as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates how often the CPU was s= talled due to L2 cache accesses by loads", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + cpu@L1D_PEND_MISS.FB_FULL\\,cm= ask\\=3D1@) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_M= ISS) / tma_info_thread_clks)", + "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + cpu@L1D_PEND_MISS.FB_FULL\\,cm= ask\\=3D0x1@) * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2= _MISS) / tma_info_thread_clks)", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT_PS", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "3.5 * tma_info_system_core_frequency * MEM_LOAD_RET= IRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) = / tma_info_thread_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, { @@ -1444,17 +1508,17 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STA= LLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "17 * tma_info_system_core_frequency * (MEM_LOAD_RET= IRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2))= / tma_info_thread_clks", + "MetricExpr": "(20.5 * tma_info_system_core_frequency - 3.5 * tma_= info_system_core_frequency) * (MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETI= RED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_info_bottleneck_cache_memory_latency, tma_mem_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", "ScaleUnit": "100%" }, { @@ -1462,18 +1526,18 @@ "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", "MetricExpr": "tma_retiring - tma_heavy_operations", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", "ScaleUnit": "100%" }, { @@ -1491,7 +1555,7 @@ "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1499,24 +1563,48 @@ "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from local memory", - "MetricExpr": "59.5 * tma_info_system_core_frequency * MEM_LOAD_L3= _MISS_RETIRED.LOCAL_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.= L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(80 * tma_info_system_core_frequency - 20.5 * tma_i= nfo_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.LOCAL_DRAM * (1 + MEM= _LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks= ", "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_grou= p", "MetricName": "tma_local_mem", - "MetricThreshold": "tma_local_mem > 0.1 & (tma_mem_latency > 0.1 &= (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)= ))", + "MetricThreshold": "tma_local_mem > 0.1 & tma_mem_latency > 0.1 & = tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_L3_MISS= _RETIRED.LOCAL_DRAM", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", "MetricExpr": "(12 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (11= * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAN= DING.CYCLES_WITH_DEMAND_RFO))) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, @@ -1528,16 +1616,16 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches, tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_bottleneck_cache= _memory_bandwidth, tma_info_system_dram_bw_use, tma_sq_full", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%" }, { @@ -1545,8 +1633,8 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_info_bottleneck_cache_memory_latency, tma_l3_hit_la= tency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%" }, { @@ -1557,11 +1645,11 @@ "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_= RETIRED.ANY", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", "MetricName": "tma_memory_operations", @@ -1575,7 +1663,7 @@ "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", "MetricName": "tma_microcode_sequencer", "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", - "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers,= tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_clears, = tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_bottleneck_irreg= ular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_m= s_switches", "ScaleUnit": "100%" }, { @@ -1583,8 +1671,8 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL= _BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_inf= o_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_branch_mispredicts, tma_info_bad_spec_branch_misprediction_cost= , tma_info_bottleneck_mispredictions", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%" }, { @@ -1597,12 +1685,12 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", "MetricExpr": "UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY= ", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", "ScaleUnit": "100%" }, { @@ -1610,8 +1698,8 @@ "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_= clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operat= ion", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, { @@ -1620,7 +1708,7 @@ "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_non_fused_branches", "MetricThreshold": "tma_non_fused_branches > 0.1 & tma_light_opera= tions > 0.6", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused.", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring branch instructions that were not fused. Non-condit= ional branches like direct JMP or CALL would count here. Can be used to exa= mine fusible conditional jumps that were not fused", "ScaleUnit": "100%" }, { @@ -1628,7 +1716,7 @@ "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / UOPS_RETI= RED.RETIRE_SLOTS", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, @@ -1642,19 +1730,19 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { @@ -1708,7 +1796,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_5", "MetricThreshold": "tma_port_5 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar,= tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_5. Related metrics: tma_fp_sc= alar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_= 512b, tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1717,7 +1805,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_6. Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector= _512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_1. Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector= _512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1734,8 +1822,8 @@ "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL)) / tma_= info_thread_clks if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CY= CLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring = * EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", "ScaleUnit": "100%" }, { @@ -1743,8 +1831,8 @@ "MetricExpr": "EXE_ACTIVITY.EXE_BOUND_0_PORTS / tma_info_thread_cl= ks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", "ScaleUnit": "100%" }, { @@ -1752,7 +1840,7 @@ "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_1 - UOPS_EXECUTED.CO= RE_CYCLES_GE_2) / 2 if #SMT_on else EXE_ACTIVITY.1_PORTS_UTIL) / tma_info_c= ore_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, @@ -1761,35 +1849,35 @@ "MetricExpr": "((UOPS_EXECUTED.CORE_CYCLES_GE_2 - UOPS_EXECUTED.CO= RE_CYCLES_GE_3) / 2 if #SMT_on else EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_c= ore_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port= _6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise).", + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", "MetricExpr": "(UOPS_EXECUTED.CORE_CYCLES_GE_3 / 2 if #SMT_on else= UOPS_EXECUTED.CORE_CYCLES_GE_3) / tma_info_core_core_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote cache in other socket= s including synchronizations issues", "MetricConstraint": "NO_GROUP_EVENTS_NMI", - "MetricExpr": "(89.5 * tma_info_system_core_frequency * MEM_LOAD_L= 3_MISS_RETIRED.REMOTE_HITM + 89.5 * tma_info_system_core_frequency * MEM_LO= AD_L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RE= TIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "((110 * tma_info_system_core_frequency - 20.5 * tma= _info_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM + (110 = * tma_info_system_core_frequency - 20.5 * tma_info_system_core_frequency) *= MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_= LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_is= sueSyncxn;tma_mem_latency_group", "MetricName": "tma_remote_cache", - "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0= .1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > = 0.2)))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_L3_MISS_R= ETIRED.REMOTE_HITM_PS;MEM_LOAD_L3_MISS_RETIRED.REMOTE_FWD_PS. Related metri= cs: tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_machin= e_clears", + "MetricThreshold": "tma_remote_cache > 0.05 & tma_mem_latency > 0.= 1 & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2= ", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. Sample with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_HITM, MEM= _LOAD_L3_MISS_RETIRED.REMOTE_FWD. Related metrics: tma_bottleneck_memory_sy= nchronization, tma_contested_accesses, tma_data_sharing, tma_false_sharing,= tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling loads from remote memory", - "MetricExpr": "127 * tma_info_system_core_frequency * MEM_LOAD_L3_= MISS_RETIRED.REMOTE_DRAM * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.= L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(147.5 * tma_info_system_core_frequency - 20.5 * tm= a_info_system_core_frequency) * MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM * (1 += MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_= clks", "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latenc= y_group", "MetricName": "tma_remote_mem", - "MetricThreshold": "tma_remote_mem > 0.1 & (tma_mem_latency > 0.1 = & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. #link to NUMA article. Sample= with: MEM_LOAD_L3_MISS_RETIRED.REMOTE_DRAM_PS", + "MetricThreshold": "tma_remote_mem > 0.1 & tma_mem_latency > 0.1 &= tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. Sample with: MEM_LOAD_L3_MISS= _RETIRED.REMOTE_DRAM", "ScaleUnit": "100%" }, { @@ -1807,7 +1895,7 @@ "MetricExpr": "PARTIAL_RAT_STALLS.SCOREBOARD / tma_info_thread_clk= s", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: PARTIAL_RAT_STALLS.SCOREBOARD. Related me= trics: tma_ms_switches", "ScaleUnit": "100%" }, @@ -1817,8 +1905,8 @@ "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", "ScaleUnit": "100%" }, { @@ -1826,17 +1914,17 @@ "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES. Related metrics: tma_port_4", "ScaleUnit": "100%" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on els= e OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_core_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_bottleneck_cache_memory_bandwidth, tma_info_system_dram_bw_use, = tma_mem_bandwidth", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%" }, { @@ -1844,8 +1932,8 @@ "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", "ScaleUnit": "100%" }, { @@ -1853,18 +1941,18 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "(L2_RQSTS.RFO_HIT * 11 * (1 - MEM_INST_RETIRED.LOCK= _LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / = MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUEST= S_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -1880,7 +1968,7 @@ "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", "ScaleUnit": "100%" }, { @@ -1888,7 +1976,31 @@ "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1896,7 +2008,7 @@ "MetricExpr": "9 * BACLEARS.ANY / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, @@ -1905,8 +2017,8 @@ "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", "ScaleUnit": "100%" }, { diff --git a/tools/perf/pmu-events/arch/x86/skylakex/uncore-cache.json b/to= ols/perf/pmu-events/arch/x86/skylakex/uncore-cache.json index 4fc818626491..da46a3aeb58c 100644 --- a/tools/perf/pmu-events/arch/x86/skylakex/uncore-cache.json +++ b/tools/perf/pmu-events/arch/x86/skylakex/uncore-cache.json @@ -4454,7 +4454,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_HIT_CRD", - "Filter": "config1=3D0x4023300000000", + "Filter": "config1=3D0x40233", "PerPkg": "1", "PublicDescription": "TOR Inserts : CRds issued by iA Cores that H= it the LLC : Counts the number of entries successfully inserted into the TO= R that match qualifications specified by the subevent. Does not include a= ddressless requests such as locks and interrupts.", "UMask": "0x11", @@ -4465,7 +4465,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_HIT_DRD", - "Filter": "config1=3D0x4043300000000", + "Filter": "config1=3D0x40433", "PerPkg": "1", "PublicDescription": "TOR Inserts : DRds issued by iA Cores that H= it the LLC : Counts the number of entries successfully inserted into the TO= R that match qualifications specified by the subevent. Does not include a= ddressless requests such as locks and interrupts.", "UMask": "0x11", @@ -4476,7 +4476,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_HIT_LlcPrefCRD", - "Filter": "config1=3D0x4b23300000000", + "Filter": "config1=3D0x4b233", "PerPkg": "1", "UMask": "0x11", "Unit": "CHA" @@ -4486,7 +4486,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_HIT_LlcPrefDRD", - "Filter": "config1=3D0x4b43300000000", + "Filter": "config1=3D0x4b433", "PerPkg": "1", "UMask": "0x11", "Unit": "CHA" @@ -4496,7 +4496,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_HIT_LlcPrefRFO", - "Filter": "config1=3D0x4b03300000000", + "Filter": "config1=3D0x4b033", "PerPkg": "1", "PublicDescription": "TOR Inserts : LLCPrefRFO issued by iA Cores = that hit the LLC : Counts the number of entries successfully inserted into = the TOR that match qualifications specified by the subevent. Does not inc= lude addressless requests such as locks and interrupts.", "UMask": "0x11", @@ -4507,7 +4507,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_HIT_RFO", - "Filter": "config1=3D0x4003300000000", + "Filter": "config1=3D0x40033", "PerPkg": "1", "PublicDescription": "TOR Inserts : RFOs issued by iA Cores that H= it the LLC : Counts the number of entries successfully inserted into the TO= R that match qualifications specified by the subevent. Does not include a= ddressless requests such as locks and interrupts.", "UMask": "0x11", @@ -4528,7 +4528,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_CRD", - "Filter": "config1=3D0x4023300000000", + "Filter": "config1=3D0x40233", "PerPkg": "1", "PublicDescription": "TOR Inserts : CRds issued by iA Cores that M= issed the LLC : Counts the number of entries successfully inserted into the= TOR that match qualifications specified by the subevent. Does not includ= e addressless requests such as locks and interrupts.", "UMask": "0x21", @@ -4539,7 +4539,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_DRD", - "Filter": "config1=3D0x4043300000000", + "Filter": "config1=3D0x40433", "PerPkg": "1", "PublicDescription": "TOR Inserts : DRds issued by iA Cores that M= issed the LLC : Counts the number of entries successfully inserted into the= TOR that match qualifications specified by the subevent. Does not includ= e addressless requests such as locks and interrupts.", "UMask": "0x21", @@ -4550,7 +4550,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_LlcPrefCRD", - "Filter": "config1=3D0x4b23300000000", + "Filter": "config1=3D0x4b233", "PerPkg": "1", "UMask": "0x21", "Unit": "CHA" @@ -4560,7 +4560,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_LlcPrefDRD", - "Filter": "config1=3D0x4b43300000000", + "Filter": "config1=3D0x4b433", "PerPkg": "1", "UMask": "0x21", "Unit": "CHA" @@ -4570,7 +4570,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_LlcPrefRFO", - "Filter": "config1=3D0x4b03300000000", + "Filter": "config1=3D0x4b033", "PerPkg": "1", "PublicDescription": "TOR Inserts : LLCPrefRFO issued by iA Cores = that missed the LLC : Counts the number of entries successfully inserted in= to the TOR that match qualifications specified by the subevent. Does not = include addressless requests such as locks and interrupts.", "UMask": "0x21", @@ -4581,7 +4581,7 @@ "Counter": "0,1,2,3", "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IA_MISS_RFO", - "Filter": "config1=3D0x4003300000000", + "Filter": "config1=3D0x40033", "PerPkg": "1", "PublicDescription": "TOR Inserts : RFOs issued by iA Cores that M= issed the LLC : Counts the number of entries successfully inserted into the= TOR that match qualifications specified by the subevent. Does not includ= e addressless requests such as locks and interrupts.", "UMask": "0x21", @@ -4624,7 +4624,7 @@ "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IO_MISS_ITOM", "Experimental": "1", - "Filter": "config1=3D0x4903300000000", + "Filter": "config1=3D0x49033", "PerPkg": "1", "PublicDescription": "Counts the number of entries successfully in= serted into the TOR that are generated from local IO ItoM requests that mis= s the LLC. An ItoM request is used by IIO to request a data write without f= irst reading the data for ownership.", "UMask": "0x24", @@ -4636,7 +4636,7 @@ "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IO_MISS_RDCUR", "Experimental": "1", - "Filter": "config1=3D0x43c3300000000", + "Filter": "config1=3D0x43C33", "PerPkg": "1", "PublicDescription": "Counts the number of entries successfully in= serted into the TOR that are generated from local IO RdCur requests and mis= s the LLC. A RdCur request is used by IIO to read data without changing sta= te.", "UMask": "0x24", @@ -4648,7 +4648,7 @@ "EventCode": "0x35", "EventName": "UNC_CHA_TOR_INSERTS.IO_MISS_RFO", "Experimental": "1", - "Filter": "config1=3D0x4003300000000", + "Filter": "config1=3D0x40033", "PerPkg": "1", "PublicDescription": "Counts the number of entries successfully in= serted into the TOR that are generated from local IO RFO requests that miss= the LLC. A read for ownership (RFO) requests a cache line to be cached in = E state with the intent to modify.", "UMask": "0x24", @@ -4865,7 +4865,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_CRD", - "Filter": "config1=3D0x4023300000000", + "Filter": "config1=3D0x40233", "PerPkg": "1", "PublicDescription": "TOR Occupancy : CRds issued by iA Cores that= Hit the LLC : For each cycle, this event accumulates the number of valid e= ntries in the TOR that match qualifications specified by the subevent. = Does not include addressless requests such as locks and interrupts.", "UMask": "0x11", @@ -4876,7 +4876,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_DRD", - "Filter": "config1=3D0x4043300000000", + "Filter": "config1=3D0x40433", "PerPkg": "1", "PublicDescription": "TOR Occupancy : DRds issued by iA Cores that= Hit the LLC : For each cycle, this event accumulates the number of valid e= ntries in the TOR that match qualifications specified by the subevent. = Does not include addressless requests such as locks and interrupts.", "UMask": "0x11", @@ -4887,7 +4887,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_LlcPrefCRD", - "Filter": "config1=3D0x4b23300000000", + "Filter": "config1=3D0x4b233", "PerPkg": "1", "UMask": "0x11", "Unit": "CHA" @@ -4897,7 +4897,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_LlcPrefDRD", - "Filter": "config1=3D0x4b43300000000", + "Filter": "config1=3D0x4b433", "PerPkg": "1", "UMask": "0x11", "Unit": "CHA" @@ -4907,7 +4907,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_LlcPrefRFO", - "Filter": "config1=3D0x4b03300000000", + "Filter": "config1=3D0x4b033", "PerPkg": "1", "PublicDescription": "TOR Occupancy : LLCPrefRFO issued by iA Core= s that hit the LLC : For each cycle, this event accumulates the number of v= alid entries in the TOR that match qualifications specified by the subevent= . Does not include addressless requests such as locks and interrupts.", "UMask": "0x11", @@ -4918,7 +4918,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_HIT_RFO", - "Filter": "config1=3D0x4003300000000", + "Filter": "config1=3D0x40033", "PerPkg": "1", "PublicDescription": "TOR Occupancy : RFOs issued by iA Cores that= Hit the LLC : For each cycle, this event accumulates the number of valid e= ntries in the TOR that match qualifications specified by the subevent. = Does not include addressless requests such as locks and interrupts.", "UMask": "0x11", @@ -4939,7 +4939,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_CRD", - "Filter": "config1=3D0x4023300000000", + "Filter": "config1=3D0x40233", "PerPkg": "1", "PublicDescription": "TOR Occupancy : CRds issued by iA Cores that= Missed the LLC : For each cycle, this event accumulates the number of vali= d entries in the TOR that match qualifications specified by the subevent. = Does not include addressless requests such as locks and interrupts.", "UMask": "0x21", @@ -4950,7 +4950,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_DRD", - "Filter": "config1=3D0x4043300000000", + "Filter": "config1=3D0x40433", "PerPkg": "1", "PublicDescription": "TOR Occupancy : DRds issued by iA Cores that= Missed the LLC : For each cycle, this event accumulates the number of vali= d entries in the TOR that match qualifications specified by the subevent. = Does not include addressless requests such as locks and interrupts.", "UMask": "0x21", @@ -4961,7 +4961,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_LlcPrefCRD", - "Filter": "config1=3D0x4b23300000000", + "Filter": "config1=3D0x4b233", "PerPkg": "1", "UMask": "0x21", "Unit": "CHA" @@ -4971,7 +4971,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_LlcPrefDRD", - "Filter": "config1=3D0x4b43300000000", + "Filter": "config1=3D0x4b433", "PerPkg": "1", "UMask": "0x21", "Unit": "CHA" @@ -4981,7 +4981,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_LlcPrefRFO", - "Filter": "config1=3D0x4b03300000000", + "Filter": "config1=3D0x4b033", "PerPkg": "1", "PublicDescription": "TOR Occupancy : LLCPrefRFO issued by iA Core= s that missed the LLC : For each cycle, this event accumulates the number o= f valid entries in the TOR that match qualifications specified by the subev= ent. Does not include addressless requests such as locks and interrupts= .", "UMask": "0x21", @@ -4992,7 +4992,7 @@ "Counter": "0", "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IA_MISS_RFO", - "Filter": "config1=3D0x4003300000000", + "Filter": "config1=3D0x40033", "PerPkg": "1", "PublicDescription": "TOR Occupancy : RFOs issued by iA Cores that= Missed the LLC : For each cycle, this event accumulates the number of vali= d entries in the TOR that match qualifications specified by the subevent. = Does not include addressless requests such as locks and interrupts.", "UMask": "0x21", @@ -5037,7 +5037,7 @@ "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IO_MISS_ITOM", "Experimental": "1", - "Filter": "config1=3D0x4903300000000", + "Filter": "config1=3D0x49033", "PerPkg": "1", "PublicDescription": "For each cycle, this event accumulates the n= umber of valid entries in the TOR that are generated from local IO ItoM req= uests that miss the LLC. An ItoM is used by IIO to request a data write wit= hout first reading the data for ownership.", "UMask": "0x24", @@ -5049,7 +5049,7 @@ "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IO_MISS_RDCUR", "Experimental": "1", - "Filter": "config1=3D0x43c3300000000", + "Filter": "config1=3D0x43C33", "PerPkg": "1", "PublicDescription": "For each cycle, this event accumulates the n= umber of valid entries in the TOR that are generated from local IO RdCur re= quests that miss the LLC. A RdCur request is used by IIO to read data witho= ut changing state.", "UMask": "0x24", @@ -5061,7 +5061,7 @@ "EventCode": "0x36", "EventName": "UNC_CHA_TOR_OCCUPANCY.IO_MISS_RFO", "Experimental": "1", - "Filter": "config1=3D0x4003300000000", + "Filter": "config1=3D0x40033", "PerPkg": "1", "PublicDescription": "For each cycle, this event accumulates the n= umber of valid entries in the TOR that are generated from local IO RFO requ= ests that miss the LLC. A read for ownership (RFO) requests data to be cach= ed in E state with the intent to modify.", "UMask": "0x24", diff --git a/tools/perf/pmu-events/arch/x86/skylakex/uncore-interconnect.js= on b/tools/perf/pmu-events/arch/x86/skylakex/uncore-interconnect.json index 216a00237cd1..eef801c68bd2 100644 --- a/tools/perf/pmu-events/arch/x86/skylakex/uncore-interconnect.json +++ b/tools/perf/pmu-events/arch/x86/skylakex/uncore-interconnect.json @@ -13754,16 +13754,5 @@ "PerPkg": "1", "PublicDescription": "Number outstanding register requests within = message channel tracker", "Unit": "UBOX" - }, - { - "BriefDescription": "UPI interconnect send bandwidth for payload. = Derived from unc_upi_txl_flits.all_data", - "Counter": "0,1,2,3", - "EventCode": "0x2", - "EventName": "UPI_DATA_BANDWIDTH_TX", - "PerPkg": "1", - "PublicDescription": "Counts valid data FLITs (80 bit FLow control= unITs: 64bits of data) transmitted (TxL) via any of the 3 Intel(R) Ultra P= ath Interconnect (UPI) slots on this UPI unit.", - "ScaleUnit": "7.11E-06Bytes", - "UMask": "0xf", - "Unit": "UPI" } ] --=20 2.48.0.rc2.279.g1de40edade-goog From nobody Sun Dec 14 13:49:51 2025 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B1A121A840A for ; Thu, 16 Jan 2025 06:45:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009946; cv=none; b=kVwNd3sxyh0+g7NlEgQ2qBUmL7e98czi5yt6IW1JS5yzKGe6o793VT92ChIwPM9r3i4mUl96ote+LguLuZH7aEahTOvR7N1sUMeZMW9AFe0MkCz/Va51GQkHPZznmbKLCKewJHKcnHpH2xbO+YJNwdKmwk4VAK/zQls5xwpozWU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737009946; c=relaxed/simple; bh=EIH92SvmElKogQsr0Xd8OfN8ZSDb5pJZwbovJV1hh/0=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=aj2gHLnX34xTBTRwby+KJWPwO32OSllAhY6E65jjs9KPhPOvpMHi/L8PDPGOllCBEPX4QCog1CNYJHNX1fG3Q6yzAzmDe2GdbsfJsBUXI1ykocC/gKLr+gkxppZ6zbW1r/L3eH2Di1qcTCixnrIBW/Bp1iaQJpDgFzH/lZmzlqA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=u/qVKSk0; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="u/qVKSk0" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-e573c46fd5dso1558356276.1 for ; Wed, 15 Jan 2025 22:45:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737009905; x=1737614705; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=IyWadMa1k9pZ0HovEDX1vNOPoC80k1vQjiQ/72a2uC0=; b=u/qVKSk0g20EPRzNQ+LRcsgeniAAUH8Lz/nZBtBq2MtKnYLzcOVNf9XuO8wckr/QqF cpTCKHEJS3yBXkXnokYHHWmL1nVniLxtHKQPoBtQD3HJs6bXshqkXkGkzHuy9M6k99Ok PFS8vcCaS+ve/s1XMHMz4ngWJu00NxAdRB8KPzIOUMd2NdHghhQogu5GiG0A3v2D6pHw vFzrXu6ahyCPg5Vvq4caFBOsh+dhUw5z2QsuHh1nsXTlmYuiwnsPtjdQ6ojB7jHoSffL Hq74R/Ojs7S8/XCVKKBsMz7i8+/d274S7x0HUhQ4PeO7A9JJnRjeP/kr5rsvVxq/yd/A bfxg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737009905; x=1737614705; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=IyWadMa1k9pZ0HovEDX1vNOPoC80k1vQjiQ/72a2uC0=; b=JpH6/+ue8MAUpifGsXYg1bDK20MaChxyacjBZCm4WRgAN5nsNFUmlWb16W+uSkqfyN sMU06melBcJKAqTsPl0Td2o8EnrepO9dRBPHXjGGgpU2PIDsJEoZIUxhlQSGS5PXJQCH jxfKqJZnWkNnQwkOFzO72Fv4raMZdhbYMEqTBYZC6fbQfpgbprJQtS/lKi89IlVYi2M1 /3rWBNQDbeHUqGqx/YnNJm7RxvwUGmNlKGPY5FtxeqDR5nju6W6PEqAE66fzaA95DnA4 qS17bnfXT8NszszvGIDDQrIWkKNU3/bUFh4mM+EEnwf7XF+TSbOQLMaoNblD5+KnAkn6 IclA== X-Forwarded-Encrypted: i=1; AJvYcCVI2aKDsAPhFz5WjMRZMNQbv8sYWOk1cNxRtHcm2C06OPMxdMezETmO21dn/QM7YJyOl3izQIm8PM4X8eQ=@vger.kernel.org X-Gm-Message-State: AOJu0YwPZ0rFkPVreNMFL5RWJf2cPAGc+Itx6ITI7JyqUlCMTvmbbvZk 6QkvgHxwuWBBMtzEBLgNU8B/O436ETBsaxwqp0GsDjqKOn541TQwcLlxMbGb4XCneKHYyWwcV3M 7IrDieQ== X-Google-Smtp-Source: AGHT+IHmdMAsw8VkH0MpnzP90v8NrZ1Pii4bH+YF3yN1A92xZlqH9EEoisRP26gfCWkG8/TJ+k0nj03QC1Yz X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:7bd:ce19:c701:bfa3]) (user=irogers job=sendgmr) by 2002:a25:a167:0:b0:e57:452e:42bf with SMTP id 3f1490d57ef6-e57452e4998mr39834276.6.1737009905383; Wed, 15 Jan 2025 22:45:05 -0800 (PST) Date: Wed, 15 Jan 2025 22:43:55 -0800 In-Reply-To: <20250116064355.344245-1-irogers@google.com> Message-Id: <20250116064355.344245-24-irogers@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250116064355.344245-1-irogers@google.com> X-Mailer: git-send-email 2.48.0.rc2.279.g1de40edade-goog Subject: [PATCH v2 23/23] perf vendor events: Update Tigerlake events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Update events from v1.16 to v1.17. Update TMA metrics from 4.8 to 5.01. Bring in the event updates v1.17: https://github.com/intel/perfmon/commit/e1d5ac3412450bf049301cb26206d03c410= 66b83 The TMA 5.01 addition is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-developed-by: Caleb Biggers Signed-off-by: Caleb Biggers Signed-off-by: Ian Rogers Acked-by: Kan Liang --- tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- .../pmu-events/arch/x86/tigerlake/cache.json | 45 +- .../arch/x86/tigerlake/frontend.json | 17 - .../pmu-events/arch/x86/tigerlake/memory.json | 13 +- .../arch/x86/tigerlake/metricgroups.json | 10 +- .../arch/x86/tigerlake/pipeline.json | 30 +- .../arch/x86/tigerlake/tgl-metrics.json | 745 +++++++++++------- .../x86/tigerlake/uncore-interconnect.json | 4 +- .../arch/x86/tigerlake/uncore-other.json | 2 +- .../arch/x86/tigerlake/virtual-memory.json | 18 + 10 files changed, 513 insertions(+), 373 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index 20c0a434de46..6b4ff18f7d9d 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -34,7 +34,7 @@ GenuineIntel-6-(37|4A|4C|4D|5A),v15,silvermont,core GenuineIntel-6-(4E|5E|8E|9E|A5|A6),v59,skylake,core GenuineIntel-6-55-[01234],v1.36,skylakex,core GenuineIntel-6-86,v1.23,snowridgex,core -GenuineIntel-6-8[CD],v1.16,tigerlake,core +GenuineIntel-6-8[CD],v1.17,tigerlake,core GenuineIntel-6-2C,v5,westmereep-dp,core GenuineIntel-6-25,v4,westmereep-sp,core GenuineIntel-6-2F,v4,westmereex,core diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/cache.json b/tools/pe= rf/pmu-events/arch/x86/tigerlake/cache.json index f4144a1110be..d316eefee857 100644 --- a/tools/perf/pmu-events/arch/x86/tigerlake/cache.json +++ b/tools/perf/pmu-events/arch/x86/tigerlake/cache.json @@ -75,14 +75,23 @@ "UMask": "0x2" }, { - "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache when triggered by an L2 cache fill.", + "BriefDescription": "Non-modified cache lines that are silently dr= opped by L2 cache.", "Counter": "0,1,2,3", "EventCode": "0xf2", "EventName": "L2_LINES_OUT.SILENT", - "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache when triggered by an L2 cache fill. These lines are ty= pically in Shared or Exclusive state. A non-threaded event.", + "PublicDescription": "Counts the number of lines that are silently= dropped by L2 cache. These lines are typically in Shared or Exclusive stat= e. A non-threaded event.", "SampleAfterValue": "200003", "UMask": "0x1" }, + { + "BriefDescription": "Cache lines that have been L2 hardware prefet= ched but not used by demand accesses", + "Counter": "0,1,2,3", + "EventCode": "0xf2", + "EventName": "L2_LINES_OUT.USELESS_HWPF", + "PublicDescription": "Counts the number of cache lines that have b= een prefetched by the L2 hardware prefetcher but not used by demand access = when evicted from the L2 cache", + "SampleAfterValue": "200003", + "UMask": "0x4" + }, { "BriefDescription": "L2 code requests", "Counter": "0,1,2,3", @@ -233,7 +242,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_LOADS", - "PEBS": "1", "PublicDescription": "Counts all retired load instructions. This e= vent accounts for SW prefetch instructions of PREFETCHNTA or PREFETCHT0/1/2= or PREFETCHW.", "SampleAfterValue": "1000003", "UMask": "0x81" @@ -244,7 +252,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ALL_STORES", - "PEBS": "1", "PublicDescription": "Counts all retired store instructions.", "SampleAfterValue": "1000003", "UMask": "0x82" @@ -255,7 +262,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts all retired memory instructions - loa= ds and stores.", "SampleAfterValue": "1000003", "UMask": "0x83" @@ -266,7 +272,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.LOCK_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with locked= access.", "SampleAfterValue": "100007", "UMask": "0x21" @@ -277,7 +282,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_LOADS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions that split = across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x41" @@ -288,7 +292,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.SPLIT_STORES", - "PEBS": "1", "PublicDescription": "Counts retired store instructions that split= across a cacheline boundary.", "SampleAfterValue": "100003", "UMask": "0x42" @@ -299,7 +302,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_LOADS", - "PEBS": "1", "PublicDescription": "Number of retired load instructions that (st= art a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x11" @@ -310,7 +312,6 @@ "Data_LA": "1", "EventCode": "0xd0", "EventName": "MEM_INST_RETIRED.STLB_MISS_STORES", - "PEBS": "1", "PublicDescription": "Number of retired store instructions that (s= tart a) miss in the 2nd-level TLB (STLB).", "SampleAfterValue": "100003", "UMask": "0x12" @@ -321,7 +322,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD", - "PEBS": "1", "PublicDescription": "Counts retired load instructions where a cro= ss-core snoop hit in another cores caches on this socket, the data was forw= arded back to the requesting core as the data was modified (SNOOP_HITM) or = the L3 did not have the data(SNOOP_HIT_WITH_FWD).", "SampleAfterValue": "20011", "UMask": "0x4" @@ -332,7 +332,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS", - "PEBS": "1", "PublicDescription": "Counts the retired load instructions whose d= ata sources were L3 hit and cross-core snoop missed in on-pkg core cache.", "SampleAfterValue": "20011", "UMask": "0x1" @@ -343,7 +342,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NONE", - "PEBS": "1", "PublicDescription": "Counts retired load instructions whose data = sources were hits in L3 without snoops required.", "SampleAfterValue": "100003", "UMask": "0x8" @@ -354,7 +352,6 @@ "Data_LA": "1", "EventCode": "0xd2", "EventName": "MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD", - "PEBS": "1", "PublicDescription": "Counts retired load instructions in which th= e L3 supplied the data and a cross-core snoop hit in another cores caches o= n this socket but that other core did not forward the data back (SNOOP_HIT_= NO_FWD).", "SampleAfterValue": "20011", "UMask": "0x2" @@ -365,7 +362,6 @@ "Data_LA": "1", "EventCode": "0xd4", "EventName": "MEM_LOAD_MISC_RETIRED.UC", - "PEBS": "1", "PublicDescription": "Retired instructions with at least one load = to uncacheable memory-type, or at least one cache-line split locked access", "SampleAfterValue": "100007", "UMask": "0x4" @@ -376,7 +372,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.FB_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop was load missed in L1 but hit FB (Fill Buffers) due to preceding= miss to the same cache line with data not ready.", "SampleAfterValue": "100007", "UMask": "0x40" @@ -387,7 +382,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L1 data cache. This event includes all SW prefet= ches and lock instructions regardless of the data source.", "SampleAfterValue": "1000003", "UMask": "0x1" @@ -398,7 +392,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L1_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L1 cache.", "SampleAfterValue": "200003", "UMask": "0x8" @@ -409,7 +402,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with L2 cac= he hits as data sources.", "SampleAfterValue": "200003", "UMask": "0x2" @@ -420,7 +412,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L2_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions missed L2 c= ache as data sources.", "SampleAfterValue": "100021", "UMask": "0x10" @@ -431,7 +422,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_HIT", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that hit in the L3 cache.", "SampleAfterValue": "100021", "UMask": "0x4" @@ -442,7 +432,6 @@ "Data_LA": "1", "EventCode": "0xd1", "EventName": "MEM_LOAD_RETIRED.L3_MISS", - "PEBS": "1", "PublicDescription": "Counts retired load instructions with at lea= st one uop that missed in the L3 cache.", "SampleAfterValue": "50021", "UMask": "0x20" @@ -458,7 +447,7 @@ "UMask": "0x1" }, { - "BriefDescription": "OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD", + "BriefDescription": "Counts demand data reads that hit a cacheline= in the L3 where a snoop hit in another cores caches which forwarded the da= ta to the requesting core.", "Counter": "0,1,2,3", "EventCode": "0xB7, 0xBB", "EventName": "OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD", @@ -532,6 +521,16 @@ "SampleAfterValue": "1000003", "UMask": "0x8" }, + { + "BriefDescription": "Cycles with offcore outstanding Code Reads tr= ansactions in the SuperQueue (SQ), queue to uncore.", + "Counter": "0,1,2,3", + "CounterMask": "1", + "EventCode": "0x60", + "EventName": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_CODE= _RD", + "PublicDescription": "Counts the number of offcore outstanding Cod= e Reads transactions in the super queue every cycle. The 'Offcore outstandi= ng' state of the transaction lasts from the L2 miss until the sending trans= action completion to requestor (SQ deallocation). See the corresponding Uma= sk under OFFCORE_REQUESTS.", + "SampleAfterValue": "1000003", + "UMask": "0x2" + }, { "BriefDescription": "Cycles when offcore outstanding Demand Data R= ead transactions are present in SuperQueue (SQ), queue to uncore", "Counter": "0,1,2,3", diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/frontend.json b/tools= /perf/pmu-events/arch/x86/tigerlake/frontend.json index 13c052d0f470..6ebdf6c5ce11 100644 --- a/tools/perf/pmu-events/arch/x86/tigerlake/frontend.json +++ b/tools/perf/pmu-events/arch/x86/tigerlake/frontend.json @@ -44,7 +44,6 @@ "EventName": "FRONTEND_RETIRED.ANY_DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x1", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= DSB (Decode stream buffer i.e. the decoded instruction-cache) miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -56,7 +55,6 @@ "EventName": "FRONTEND_RETIRED.DSB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x11", - "PEBS": "1", "PublicDescription": "Number of retired Instructions that experien= ced a critical DSB (Decode stream buffer i.e. the decoded instruction-cache= ) miss. Critical means stalls were exposed to the back-end as a result of t= he DSB miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -68,7 +66,6 @@ "EventName": "FRONTEND_RETIRED.ITLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x14", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= iTLB (Instruction TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -80,7 +77,6 @@ "EventName": "FRONTEND_RETIRED.L1I_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x12", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L1 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -92,7 +88,6 @@ "EventName": "FRONTEND_RETIRED.L2_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x13", - "PEBS": "1", "PublicDescription": "Counts retired Instructions who experienced = Instruction L2 Cache true miss.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -104,7 +99,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x500106", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 1 cycle which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -116,7 +110,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_128", "MSRIndex": "0x3F7", "MSRValue": "0x508006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 12= 8 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -128,7 +121,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_16", "MSRIndex": "0x3F7", "MSRValue": "0x501006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 16 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -140,7 +132,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2", "MSRIndex": "0x3F7", "MSRValue": "0x500206", - "PEBS": "1", "PublicDescription": "Retired instructions that are fetched after = an interval where the front-end delivered no uops for a period of at least = 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -152,7 +143,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_256", "MSRIndex": "0x3F7", "MSRValue": "0x510006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 25= 6 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -164,7 +154,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1", "MSRIndex": "0x3F7", "MSRValue": "0x100206", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after the front-end had at least 1 bubble-slot for a per= iod of 2 cycles. A bubble-slot is an empty issue-pipeline slot while there = was no RAT stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -176,7 +165,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_32", "MSRIndex": "0x3F7", "MSRValue": "0x502006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 32 cycles. During th= is period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -188,7 +176,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_4", "MSRIndex": "0x3F7", "MSRValue": "0x500406", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 4 = cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -200,7 +187,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_512", "MSRIndex": "0x3F7", "MSRValue": "0x520006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 51= 2 cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -212,7 +198,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_64", "MSRIndex": "0x3F7", "MSRValue": "0x504006", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are fetched= after an interval where the front-end delivered no uops for a period of 64= cycles which was not interrupted by a back-end stall.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -224,7 +209,6 @@ "EventName": "FRONTEND_RETIRED.LATENCY_GE_8", "MSRIndex": "0x3F7", "MSRValue": "0x500806", - "PEBS": "1", "PublicDescription": "Counts retired instructions that are deliver= ed to the back-end after a front-end stall of at least 8 cycles. During thi= s period the front-end delivered no uops.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -236,7 +220,6 @@ "EventName": "FRONTEND_RETIRED.STLB_MISS", "MSRIndex": "0x3F7", "MSRValue": "0x15", - "PEBS": "1", "PublicDescription": "Counts retired Instructions that experienced= STLB (2nd level TLB) true miss.", "SampleAfterValue": "100007", "UMask": "0x1" diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/memory.json b/tools/p= erf/pmu-events/arch/x86/tigerlake/memory.json index a125cefa100f..f99018cab7e7 100644 --- a/tools/perf/pmu-events/arch/x86/tigerlake/memory.json +++ b/tools/perf/pmu-events/arch/x86/tigerlake/memory.json @@ -25,7 +25,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_128", "MSRIndex": "0x3F6", "MSRValue": "0x80", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 128 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "1009", "UMask": "0x1" @@ -38,7 +37,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_16", "MSRIndex": "0x3F6", "MSRValue": "0x10", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 16 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "20011", "UMask": "0x1" @@ -51,7 +49,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_256", "MSRIndex": "0x3F6", "MSRValue": "0x100", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 256 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "503", "UMask": "0x1" @@ -64,7 +61,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32", "MSRIndex": "0x3F6", "MSRValue": "0x20", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 32 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "100007", "UMask": "0x1" @@ -77,7 +73,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_4", "MSRIndex": "0x3F6", "MSRValue": "0x4", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 4 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "100003", "UMask": "0x1" @@ -90,7 +85,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_512", "MSRIndex": "0x3F6", "MSRValue": "0x200", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 512 cycles. Reported= latency may be longer than just the memory latency.", "SampleAfterValue": "101", "UMask": "0x1" @@ -103,7 +97,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64", "MSRIndex": "0x3F6", "MSRValue": "0x40", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 64 cycles. Reported = latency may be longer than just the memory latency.", "SampleAfterValue": "2003", "UMask": "0x1" @@ -116,7 +109,6 @@ "EventName": "MEM_TRANS_RETIRED.LOAD_LATENCY_GT_8", "MSRIndex": "0x3F6", "MSRValue": "0x8", - "PEBS": "2", "PublicDescription": "Counts randomly selected loads when the late= ncy from first dispatch to completion is greater than 8 cycles. Reported l= atency may be longer than just the memory latency.", "SampleAfterValue": "50021", "UMask": "0x1" @@ -135,17 +127,16 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc9", "EventName": "RTM_RETIRED.ABORTED", - "PEBS": "1", "PublicDescription": "Counts the number of times RTM abort was tri= ggered.", "SampleAfterValue": "100003", "UMask": "0x4" }, { - "BriefDescription": "Number of times an RTM execution aborted due = to none of the previous 4 categories (e.g. interrupt)", + "BriefDescription": "Number of times an RTM execution aborted due = to none of the previous 3 categories (e.g. interrupt)", "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc9", "EventName": "RTM_RETIRED.ABORTED_EVENTS", - "PublicDescription": "Counts the number of times an RTM execution = aborted due to none of the previous 4 categories (e.g. interrupt).", + "PublicDescription": "Counts the number of times an RTM execution = aborted due to none of the previous 3 categories (e.g. interrupt).", "SampleAfterValue": "100003", "UMask": "0x80" }, diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/metricgroups.json b/t= ools/perf/pmu-events/arch/x86/tigerlake/metricgroups.json index 3a88260194d1..80ca8021f2de 100644 --- a/tools/perf/pmu-events/arch/x86/tigerlake/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/tigerlake/metricgroups.json @@ -37,6 +37,7 @@ "InsType": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -83,7 +84,9 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_mispredicts_group": "Metrics contributing to tma_branch_mi= spredicts category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", + "tma_code_stlb_miss_group": "Metrics contributing to tma_code_stlb_mis= s category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -93,6 +96,7 @@ "tma_fp_vector_group": "Metrics contributing to tma_fp_vector category= ", "tma_frontend_bound_group": "Metrics contributing to tma_frontend_boun= d category", "tma_heavy_operations_group": "Metrics contributing to tma_heavy_opera= tions category", + "tma_icache_misses_group": "Metrics contributing to tma_icache_misses = category", "tma_issue2P": "Metrics related by the issue $issue2P", "tma_issueBM": "Metrics related by the issue $issueBM", "tma_issueBW": "Metrics related by the issue $issueBW", @@ -112,10 +116,13 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", + "tma_l2_bound_group": "Metrics contributing to tma_l2_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", "tma_load_op_utilization_group": "Metrics contributing to tma_load_op_= utilization category", + "tma_load_stlb_miss_group": "Metrics contributing to tma_load_stlb_mis= s category", "tma_machine_clears_group": "Metrics contributing to tma_machine_clear= s category", "tma_mem_latency_group": "Metrics contributing to tma_mem_latency cate= gory", "tma_memory_bound_group": "Metrics contributing to tma_memory_bound ca= tegory", @@ -128,5 +135,6 @@ "tma_retiring_group": "Metrics contributing to tma_retiring category", "tma_serializing_operation_group": "Metrics contributing to tma_serial= izing_operation category", "tma_store_bound_group": "Metrics contributing to tma_store_bound cate= gory", - "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category" + "tma_store_op_utilization_group": "Metrics contributing to tma_store_o= p_utilization category", + "tma_store_stlb_miss_group": "Metrics contributing to tma_store_stlb_m= iss category" } diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/pipeline.json b/tools= /perf/pmu-events/arch/x86/tigerlake/pipeline.json index 09b53b0722a9..7ef1bac08463 100644 --- a/tools/perf/pmu-events/arch/x86/tigerlake/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/tigerlake/pipeline.json @@ -9,6 +9,15 @@ "SampleAfterValue": "1000003", "UMask": "0x9" }, + { + "BriefDescription": "ARITH.FP_DIVIDER_ACTIVE", + "Counter": "0,1,2,3,4,5,6,7", + "CounterMask": "1", + "EventCode": "0x14", + "EventName": "ARITH.FP_DIVIDER_ACTIVE", + "SampleAfterValue": "1000003", + "UMask": "0x1" + }, { "BriefDescription": "Number of occurrences where a microcode assis= t is invoked by hardware.", "Counter": "0,1,2,3,4,5,6,7", @@ -23,7 +32,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all branch instructions retired.", "SampleAfterValue": "400009" }, @@ -32,7 +40,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts conditional branch instructions retir= ed.", "SampleAfterValue": "400009", "UMask": "0x11" @@ -42,7 +49,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts not taken branch instructions retired= .", "SampleAfterValue": "400009", "UMask": "0x10" @@ -52,7 +58,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional branch instructions= retired.", "SampleAfterValue": "400009", "UMask": "0x1" @@ -62,7 +67,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.FAR_BRANCH", - "PEBS": "1", "PublicDescription": "Counts far branch instructions retired.", "SampleAfterValue": "100007", "UMask": "0x40" @@ -72,7 +76,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts near indirect branch instructions ret= ired excluding returns. TSX abort is an indirect branch.", "SampleAfterValue": "100003", "UMask": "0x80" @@ -82,7 +85,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_CALL", - "PEBS": "1", "PublicDescription": "Counts both direct and indirect near call in= structions retired.", "SampleAfterValue": "100007", "UMask": "0x2" @@ -92,7 +94,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_RETURN", - "PEBS": "1", "PublicDescription": "Counts return instructions retired.", "SampleAfterValue": "100007", "UMask": "0x8" @@ -102,7 +103,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc4", "EventName": "BR_INST_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken branch instructions retired.", "SampleAfterValue": "400009", "UMask": "0x20" @@ -112,7 +112,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.ALL_BRANCHES", - "PEBS": "1", "PublicDescription": "Counts all the retired branch instructions t= hat were mispredicted by the processor. A branch misprediction occurs when = the processor incorrectly predicts the destination of the branch. When the= misprediction is discovered at execution, all the instructions executed in= the wrong (speculative) path must be discarded, and the processor must sta= rt fetching from the correct path.", "SampleAfterValue": "50021" }, @@ -121,7 +120,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND", - "PEBS": "1", "PublicDescription": "Counts mispredicted conditional branch instr= uctions retired.", "SampleAfterValue": "50021", "UMask": "0x11" @@ -131,7 +129,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_NTAKEN", - "PEBS": "1", "PublicDescription": "Counts the number of conditional branch inst= ructions retired that were mispredicted and the branch direction was not ta= ken.", "SampleAfterValue": "50021", "UMask": "0x10" @@ -141,7 +138,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.COND_TAKEN", - "PEBS": "1", "PublicDescription": "Counts taken conditional mispredicted branch= instructions retired.", "SampleAfterValue": "50021", "UMask": "0x1" @@ -151,7 +147,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT", - "PEBS": "1", "PublicDescription": "Counts all miss-predicted indirect branch in= structions retired (excluding RETs. TSX aborts is considered indirect branc= h).", "SampleAfterValue": "50021", "UMask": "0x80" @@ -161,7 +156,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.INDIRECT_CALL", - "PEBS": "1", "PublicDescription": "Counts retired mispredicted indirect (near t= aken) CALL instructions, including both register and memory indirect.", "SampleAfterValue": "50021", "UMask": "0x2" @@ -171,7 +165,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.NEAR_TAKEN", - "PEBS": "1", "PublicDescription": "Counts number of near branch instructions re= tired that were mispredicted and taken.", "SampleAfterValue": "50021", "UMask": "0x20" @@ -181,7 +174,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc5", "EventName": "BR_MISP_RETIRED.RET", - "PEBS": "1", "PublicDescription": "This is a non-precise version (that is, does= not use PEBS) of the event that counts mispredicted return instructions re= tired.", "SampleAfterValue": "50021", "UMask": "0x8" @@ -396,7 +388,6 @@ "BriefDescription": "Number of instructions retired. Fixed Counter= - architectural event", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PEBS": "1", "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003", "UMask": "0x1" @@ -406,7 +397,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.ANY_P", - "PEBS": "1", "PublicDescription": "Counts the number of X86 instructions retire= d - an Architectural PerfMon event. Counting continues during hardware inte= rrupts, traps, and inside interrupt handlers. Notes: INST_RETIRED.ANY is co= unted by a designated fixed counter freeing up programmable counters to cou= nt other events. INST_RETIRED.ANY_P is counted by a programmable counter.", "SampleAfterValue": "2000003" }, @@ -415,7 +405,6 @@ "Counter": "0,1,2,3,4,5,6,7", "EventCode": "0xc0", "EventName": "INST_RETIRED.NOP", - "PEBS": "1", "PublicDescription": "Counts all retired NOP or ENDBR32/64 instruc= tions", "SampleAfterValue": "2000003", "UMask": "0x2" @@ -424,7 +413,6 @@ "BriefDescription": "Precise instruction retired event with a redu= ced effect of PEBS shadow in IP distribution", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.PREC_DIST", - "PEBS": "1", "PublicDescription": "A version of INST_RETIRED that allows for a = more unbiased distribution of samples across instructions retired. It utili= zes the Precise Distribution of Instructions Retired (PDIR) feature to miti= gate some bias in how retired instructions get sampled. Use on Fixed Counte= r 0.", "SampleAfterValue": "2000003", "UMask": "0x1" diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json b/to= ols/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json index c45c6b4a380d..8c0cd6e63a2a 100644 --- a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json +++ b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json @@ -89,12 +89,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", + "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", "MetricExpr": "(UOPS_DISPATCHED.PORT_0 + UOPS_DISPATCHED.PORT_1 + = UOPS_DISPATCHED.PORT_5 + UOPS_DISPATCHED.PORT_6) / (4 * tma_info_core_core_= clks)", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", "MetricName": "tma_alu_op_utilization", @@ -106,7 +106,7 @@ "MetricExpr": "34 * ASSISTS.ANY / tma_info_thread_slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: ASSISTS.ANY", "ScaleUnit": "100%" }, @@ -129,11 +129,104 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions.", + "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", + "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_icache_misses + tma_itlb_misses = + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches)", + "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", + "MetricName": "tma_bottleneck_big_code", + "MetricThreshold": "tma_bottleneck_big_code > 20" + }, + { + "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", + "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", + "MetricGroup": "BvBO;Ret", + "MetricName": "tma_bottleneck_branching_overhead", + "MetricThreshold": "tma_bottleneck_branching_overhead > 5", + "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dr= am_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound += tma_store_bound)) * (tma_fb_full / (tma_dtlb_load + tma_store_fwd_blk + tm= a_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4k_alias= ing + tma_fb_full)))", + "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", + "MetricName": "tma_bottleneck_cache_memory_bandwidth", + "MetricThreshold": "tma_bottleneck_cache_memory_bandwidth > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" + }, + { + "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", + "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bou= nd + tma_store_bound) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + = tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_l1_= latency_dependency / (tma_dtlb_load + tma_store_fwd_blk + tma_l1_latency_de= pendency + tma_lock_latency + tma_split_loads + tma_4k_aliasing + tma_fb_fu= ll)) + tma_memory_bound * (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tm= a_l3_bound + tma_dram_bound + tma_store_bound)) * (tma_lock_latency / (tma_= dtlb_load + tma_store_fwd_blk + tma_l1_latency_dependency + tma_lock_latenc= y + tma_split_loads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * = (tma_l1_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_boun= d + tma_store_bound)) * (tma_split_loads / (tma_dtlb_load + tma_store_fwd_b= lk + tma_l1_latency_dependency + tma_lock_latency + tma_split_loads + tma_4= k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound / (tma_l1_= bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound)) * = (tma_split_stores / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store)) + tma_memory_bound * (tma_stor= e_bound / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tm= a_store_bound)) * (tma_store_latency / (tma_store_latency + tma_false_shari= ng + tma_split_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", + "MetricName": "tma_bottleneck_cache_memory_latency", + "MetricThreshold": "tma_bottleneck_cache_memory_latency > 20", + "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" + }, + { + "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", + "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_serializing_operation + tma_ports_utilization) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_serializing_operation + tma_ports_= utilization)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", + "MetricGroup": "BvCB;Cor;tma_issueComp", + "MetricName": "tma_bottleneck_compute_bound_est", + "MetricThreshold": "tma_bottleneck_compute_bound_est > 20", + "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy" + }, + { + "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", + "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses + t= ma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * (tma_fetch_latency * (tma_ms_swi= tches + tma_branch_resteers * (tma_clears_resteers + 10 * tma_microcode_seq= uencer * tma_other_mispredicts / tma_branch_mispredicts * tma_mispredicts_r= esteers) / (tma_mispredicts_resteers + tma_clears_resteers + tma_unknown_br= anches)) / (tma_icache_misses + tma_itlb_misses + tma_branch_resteers + tma= _ms_switches + tma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_ms /= (tma_mite + tma_dsb + tma_lsd + tma_ms))) - tma_bottleneck_big_code", + "MetricGroup": "BvFB;Fed;FetchBW;Frontend", + "MetricName": "tma_bottleneck_instruction_fetch_bw", + "MetricThreshold": "tma_bottleneck_instruction_fetch_bw > 20" + }, + { + "BriefDescription": "Total pipeline cost of irregular execution (e= .g", + "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * (tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_cle= ars_resteers + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_b= ranch_mispredicts * tma_mispredicts_resteers) / (tma_mispredicts_resteers += tma_clears_resteers + tma_unknown_branches)) / (tma_icache_misses + tma_it= lb_misses + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switc= hes) + tma_fetch_bandwidth * tma_ms / (tma_mite + tma_dsb + tma_lsd + tma_m= s)) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_mis= predicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes / = tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_bo= und * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_0)= / (tma_divider + tma_serializing_operation + tma_ports_utilization) + tma_= microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer)= * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", + "MetricName": "tma_bottleneck_irregular_overhead", + "MetricThreshold": "tma_bottleneck_irregular_overhead > 10", + "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" + }, + { + "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_dtlb_load + tma_= store_fwd_blk + tma_l1_latency_dependency + tma_lock_latency + tma_split_lo= ads + tma_4k_aliasing + tma_fb_full)) + tma_memory_bound * (tma_store_bound= / (tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store= _bound)) * (tma_dtlb_store / (tma_store_latency + tma_false_sharing + tma_s= plit_stores + tma_streaming_stores + tma_dtlb_store)))", + "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", + "MetricName": "tma_bottleneck_memory_data_tlbs", + "MetricThreshold": "tma_bottleneck_memory_data_tlbs > 20", + "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store" + }, + { + "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", + "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_l1_b= ound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_l1= _bound + tma_l2_bound + tma_l3_bound + tma_dram_bound + tma_store_bound) * = tma_false_sharing / (tma_store_latency + tma_false_sharing + tma_split_stor= es + tma_streaming_stores + tma_dtlb_store - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", + "MetricGroup": "BvMS;LockCont;Mem;Offcore;tma_issueSyncxn", + "MetricName": "tma_bottleneck_memory_synchronization", + "MetricThreshold": "tma_bottleneck_memory_synchronization > 10", + "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_contested_accesses, tma_data_sharing, tma_false_s= haring, tma_machine_clears" + }, + { + "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", + "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_icache_misses + tma_itlb_misses= + tma_branch_resteers + tma_ms_switches + tma_lcp + tma_dsb_switches))", + "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", + "MetricName": "tma_bottleneck_mispredictions", + "MetricThreshold": "tma_bottleneck_mispredictions > 20", + "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" + }, + { + "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", + "MetricExpr": "100 - (tma_bottleneck_big_code + tma_bottleneck_ins= truction_fetch_bw + tma_bottleneck_mispredictions + tma_bottleneck_cache_me= mory_bandwidth + tma_bottleneck_cache_memory_latency + tma_bottleneck_memor= y_data_tlbs + tma_bottleneck_memory_synchronization + tma_bottleneck_comput= e_bound_est + tma_bottleneck_irregular_overhead + tma_bottleneck_branching_= overhead + tma_bottleneck_useful_work)", + "MetricGroup": "BvOB;Cor;Offcore", + "MetricName": "tma_bottleneck_other_bottlenecks", + "MetricThreshold": "tma_bottleneck_other_bottlenecks > 20", + "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls" + }, + { + "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead", + "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", + "MetricGroup": "BvUW;Ret", + "MetricName": "tma_bottleneck_useful_work", + "MetricThreshold": "tma_bottleneck_useful_work > 20" + }, + { + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring branch instructions", "MetricExpr": "tma_light_operations * BR_INST_RETIRED.ALL_BRANCHES= / (tma_retiring * tma_info_thread_slots)", "MetricGroup": "Branches;BvBO;Pipeline;TopdownL3;tma_L3_group;tma_= light_operations_group", "MetricName": "tma_branch_instructions", @@ -147,7 +240,7 @@ "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredic= tions, tma_mispredicts_resteers", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_bottleneck_mispredictions, tma_info_bad_spec_branch_misprediction_cost= , tma_mispredicts_resteers", "ScaleUnit": "100%" }, { @@ -155,8 +248,8 @@ "MetricExpr": "INT_MISC.CLEAR_RESTEER_CYCLES / tma_info_thread_clk= s + tma_unknown_branches", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -164,8 +257,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", "ScaleUnit": "100%" }, { @@ -173,18 +266,66 @@ "MetricExpr": "(1 - BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRE= D.ALL_BRANCHES + MACHINE_CLEARS.COUNT)) * INT_MISC.CLEAR_RESTEER_CYCLES / t= ma_info_thread_clks", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Sam= ple with: INT_MISC.CLEAR_RESTEER_CYCLES. Related metrics: tma_l1_bound, tma= _machine_clears, tma_microcode_sequencer, tma_ms_switches", "ScaleUnit": "100%" }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that hit in the L2 cache", + "MetricExpr": "max(0, tma_icache_misses - tma_code_l2_miss)", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_hit", + "MetricThreshold": "tma_code_l2_hit > 0.05 & tma_icache_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates fraction of cycles the = CPU was stalled due to instruction cache misses that miss in the L2 cache", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.CYCLES_WITH_DEMAND_COD= E_RD / tma_info_thread_clks", + "MetricGroup": "FetchLat;IcMiss;Offcore;TopdownL4;tma_L4_group;tma= _icache_misses_group", + "MetricName": "tma_code_l2_miss", + "MetricThreshold": "tma_code_l2_miss > 0.05 & tma_icache_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric roughly estimates the fraction of= cycles where the (first level) ITLB was missed by instructions fetches, th= at later on hit in second-level TLB (STLB)", + "MetricExpr": "max(0, tma_itlb_misses - tma_code_stlb_miss)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_hit", + "MetricThreshold": "tma_code_stlb_hit > 0.05 & tma_itlb_misses > 0= .05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = where the Second-level TLB (STLB) was missed by instruction fetches, perfor= ming a hardware page walk", + "MetricExpr": "ITLB_MISSES.WALK_ACTIVE / tma_info_thread_clks", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL4;tma_L4_group;tma_itlb= _misses_group", + "MetricName": "tma_code_stlb_miss", + "MetricThreshold": "tma_code_stlb_miss > 0.05 & tma_itlb_misses > = 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_2M_= 4M / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_2m", + "MetricThreshold": "tma_code_stlb_miss_2m > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= (instruction) code accesses", + "MetricExpr": "tma_code_stlb_miss * ITLB_MISSES.WALK_COMPLETED_4K = / (ITLB_MISSES.WALK_COMPLETED_4K + ITLB_MISSES.WALK_COMPLETED_2M_4M)", + "MetricGroup": "FetchLat;MemoryTLB;TopdownL5;tma_L5_group;tma_code= _stlb_miss_group", + "MetricName": "tma_code_stlb_miss_4k", + "MetricThreshold": "tma_code_stlb_miss_4k > 0.05 & tma_code_stlb_m= iss > 0.05 & tma_itlb_misses > 0.05 & tma_fetch_latency > 0.1 & tma_fronten= d_bound > 0.15", + "ScaleUnit": "100%" + }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "(49 * tma_info_system_core_frequency * (MEM_LOAD_L3= _HIT_RETIRED.XSNP_FWD * (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND= _DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))= ) + 48 * tma_info_system_core_frequency * MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS= ) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info= _thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricExpr": "((54 * tma_info_system_core_frequency - 5 * tma_inf= o_system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (OCR.DEMAND_= DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEM= AND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) + (53 * tma_info_system_core_frequ= ency - 5 * tma_info_system_core_frequency) * MEM_LOAD_L3_HIT_RETIRED.XSNP_M= ISS) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma_i= nfo_thread_clks", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related m= etrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma_remote= _cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_FWD, MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS. Related = metrics: tma_bottleneck_memory_synchronization, tma_data_sharing, tma_false= _sharing, tma_machine_clears", "ScaleUnit": "100%" }, { @@ -194,25 +335,25 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to data-sharing ac= cesses", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "48 * tma_info_system_core_frequency * (MEM_LOAD_L3_= HIT_RETIRED.XSNP_NO_FWD + MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD * (1 - OCR.DEMAN= D_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.D= EMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT /= MEM_LOAD_RETIRED.L1_MISS / 2) / tma_info_thread_clks", + "MetricExpr": "(53 * tma_info_system_core_frequency - 5 * tma_info= _system_core_frequency) * (MEM_LOAD_L3_HIT_RETIRED.XSNP_NO_FWD + MEM_LOAD_L= 3_HIT_RETIRED.XSNP_FWD * (1 - OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HITM / (OCR.D= EMAND_DATA_RD.L3_HIT.SNOOP_HITM + OCR.DEMAND_DATA_RD.L3_HIT.SNOOP_HIT_WITH_= FWD))) * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / tma= _info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_NO_FWD. Related metrics: tma_bottleneck_memory_synch= ronization, tma_contested_accesses, tma_false_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re decoder-0 was the only active decoder", - "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D1@ - cpu@INS= T_DECODED.DECODERS\\,cmask\\=3D2@) / tma_info_core_core_clks / 2", + "MetricExpr": "(cpu@INST_DECODED.DECODERS\\,cmask\\=3D0x1@ - cpu@I= NST_DECODED.DECODERS\\,cmask\\=3D0x2@) / tma_info_core_core_clks / 2", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_issueD0= ;tma_mite_group", "MetricName": "tma_decoder0_alone", - "MetricThreshold": "tma_decoder0_alone > 0.1 & (tma_mite > 0.1 & t= ma_fetch_bandwidth > 0.2)", + "MetricThreshold": "tma_decoder0_alone > 0.1 & tma_mite > 0.1 & tm= a_fetch_bandwidth > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere decoder-0 was the only active decoder. Related metrics: tma_few_uops_in= structions", "ScaleUnit": "100%" }, @@ -221,7 +362,7 @@ "MetricExpr": "ARITH.DIVIDER_ACTIVE / tma_info_thread_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_ACTIVE", "ScaleUnit": "100%" }, @@ -231,8 +372,8 @@ "MetricExpr": "CYCLE_ACTIVITY.STALLS_L3_MISS / tma_info_thread_clk= s + (CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks - tma_l2_bound", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS_PS", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_RETIRED= .L3_MISS", "ScaleUnit": "100%" }, { @@ -241,7 +382,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", "ScaleUnit": "100%" }, { @@ -249,44 +390,44 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS_PS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_ban= dwidth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_= info_inst_mix_iptb, tma_lcp", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Sample with: FRONTEND_RETIRED.DS= B_MISS. Related metrics: tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwi= dth, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_inf= o_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D1= @ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYCLE= _ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", + "MetricExpr": "min(7 * cpu@DTLB_LOAD_MISSES.STLB_HIT\\,cmask\\=3D0= x1@ + DTLB_LOAD_MISSES.WALK_ACTIVE, max(CYCLE_ACTIVITY.CYCLES_MEM_ANY - CYC= LE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store, tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_me= mory_synchronization", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_INST_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= bottleneck_memory_data_tlbs, tma_dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D1@ = + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", + "MetricExpr": "(7 * cpu@DTLB_STORE_MISSES.STLB_HIT\\,cmask\\=3D0x1= @ + DTLB_STORE_MISSES.WALK_ACTIVE) / tma_info_core_core_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load, = tma_info_bottleneck_memory_data_tlbs, tma_info_bottleneck_memory_synchroniz= ation", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_INST_RETIRED.STLB_MISS_STORES. Related metrics: tma_bottleneck_mem= ory_data_tlbs, tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", "MetricExpr": "54 * tma_info_system_core_frequency * OCR.DEMAND_RF= O.L3_HIT.SNOOP_HITM / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_contested_accesses, tma_data_sharing= , tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: OCR.DEMAND_RFO.L= 3_HIT.SNOOP_HITM. Related metrics: tma_bottleneck_memory_synchronization, t= ma_contested_accesses, tma_data_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricExpr": "L1D_PEND_MISS.FB_FULL / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_bottleneck_cache_memory_bandwi= dth, tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store= _latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_bottleneck_cache_memory_bandwidth, = tma_info_system_dram_bw_use, tma_mem_bandwidth, tma_sq_full, tma_store_late= ncy, tma_streaming_stores", "ScaleUnit": "100%" }, { @@ -296,7 +437,7 @@ "MetricName": "tma_fetch_bandwidth", "MetricThreshold": "tma_fetch_bandwidth > 0.2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LA= TENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_sw= itches, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tm= a_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend bandwidth issues. For example; inefficien= cies at the instruction decoders; or restrictions for caching in the DSB (d= ecoded uops cache) are categorized under Fetch Bandwidth. In such cases; th= e Frontend typically delivers suboptimal amount of uops to the Backend. Sam= ple with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1, FRONTEND_RETIRED.LATE= NCY_GE_1, FRONTEND_RETIRED.LATENCY_GE_2. Related metrics: tma_dsb_switches,= tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_= frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { @@ -306,16 +447,16 @@ "MetricName": "tma_fetch_latency", "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound >= 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS", + "PublicDescription": "This metric represents fraction of slots the= CPU was stalled due to Frontend latency issues. For example; instruction-= cache misses; iTLB misses or fetch stalls after a branch misprediction are = categorized under Frontend Latency. In such cases; the Frontend eventually = delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_= 16, FRONTEND_RETIRED.LATENCY_GE_8", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or up to= ([SNB+] four; [ADL+] five) uops", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring instructions that that are decoder into two or more = uops", "MetricExpr": "tma_heavy_operations - tma_microcode_sequencer", "MetricGroup": "TopdownL3;tma_L3_group;tma_heavy_operations_group;= tma_issueD0", "MetricName": "tma_few_uops_instructions", "MetricThreshold": "tma_few_uops_instructions > 0.05 & tma_heavy_o= perations > 0.1", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or up t= o ([SNB+] four; [ADL+] five) uops. This highly-correlates with the number o= f uops in such instructions. Related metrics: tma_decoder0_alone", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring instructions that that are decoder into two or more= uops. This highly-correlates with the number of uops in such instructions.= Related metrics: tma_decoder0_alone", "ScaleUnit": "100%" }, { @@ -324,7 +465,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", "ScaleUnit": "100%" }, { @@ -333,7 +474,15 @@ "MetricGroup": "HPC;TopdownL5;tma_L5_group;tma_assists_group", "MetricName": "tma_fp_assists", "MetricThreshold": "tma_fp_assists > 0.1", - "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals).", + "PublicDescription": "This metric roughly estimates fraction of sl= ots the CPU retired uops as a result of handing Floating Point (FP) Assists= . FP Assist may apply when working with very small floating point values (s= o-called Denormals)", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Floating-Point Divider unit was active", + "MetricExpr": "ARITH.FP_DIVIDER_ACTIVE / tma_info_thread_clks", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_fp_divider", + "MetricThreshold": "tma_fp_divider > 0.2 & tma_divider > 0.2 & tma= _core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -341,7 +490,7 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -350,7 +499,7 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / (tma_retiring * tma_= info_thread_slots)", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -359,8 +508,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -368,8 +517,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -377,7 +526,7 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.512B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.512B_PACKED_SINGLE) / (tma_retiring * tma_info_thread_slots)= ", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_512b", - "MetricThreshold": "tma_fp_vector_512b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", + "MetricThreshold": "tma_fp_vector_512b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 512-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, @@ -389,17 +538,17 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1;Default", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4_PS", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound. Sample with: FRONTE= ND_RETIRED.LATENCY_GE_4", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", - "MetricExpr": "tma_microcode_sequencer + tma_retiring * (UOPS_DECO= DED.DEC0 - cpu@UOPS_DECODED.DEC0\\,cmask\\=3D1@) / IDQ.MITE_UOPS", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", + "MetricExpr": "tma_microcode_sequencer + tma_retiring * (UOPS_DECO= DED.DEC0 - cpu@UOPS_DECODED.DEC0\\,cmask\\=3D0x1@) / IDQ.MITE_UOPS", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { @@ -407,41 +556,41 @@ "MetricExpr": "ICACHE_DATA.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS_PS;FRONTEND_RETIRED.L1I_MISS_PS", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to instruction cache misses. Sample with: FRONTEND_RE= TIRED.L2_MISS, FRONTEND_RETIRED.L1I_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "Branch Misprediction Cost: Fraction of TMA sl= ots wasted per non-speculative branch misprediction (retired JEClear)", + "BriefDescription": "Branch Misprediction Cost: Cycles representin= g fraction of TMA slots wasted per non-speculative branch misprediction (re= tired JEClear)", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "tma_info_bottleneck_mispredictions * tma_info_threa= d_slots / BR_MISP_RETIRED.ALL_BRANCHES / 100", + "MetricExpr": "tma_bottleneck_mispredictions * tma_info_thread_slo= ts / 5 / BR_MISP_RETIRED.ALL_BRANCHES / 100", "MetricGroup": "Bad;BrMispredicts;tma_issueBM", "MetricName": "tma_info_bad_spec_branch_misprediction_cost", - "PublicDescription": "Branch Misprediction Cost: Fraction of TMA s= lots wasted per non-speculative branch misprediction (retired JEClear). Rel= ated metrics: tma_branch_mispredicts, tma_info_bottleneck_mispredictions, t= ma_mispredicts_resteers" + "PublicDescription": "Branch Misprediction Cost: Cycles representi= ng fraction of TMA slots wasted per non-speculative branch misprediction (r= etired JEClear). Related metrics: tma_bottleneck_mispredictions, tma_branch= _mispredicts, tma_mispredicts_resteers" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional non-taken branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_NTAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_ntaken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_ntaken < 200" }, { - "BriefDescription": "Instructions per retired mispredicts for cond= itional taken branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for cond= itional taken branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.COND_TAKEN", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_cond_taken", "MetricThreshold": "tma_info_bad_spec_ipmisp_cond_taken < 200" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.INDIRECT", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" }, { - "BriefDescription": "Instructions per retired mispredicts for retu= rn branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for retu= rn branches (lower number means higher occurrence rate)", "MetricExpr": "INST_RETIRED.ANY / BR_MISP_RETIRED.RET", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_ret", @@ -455,7 +604,7 @@ "MetricThreshold": "tma_info_bad_spec_ipmispredict < 200" }, { - "BriefDescription": "Speculative to Retired ratio of all clears (c= overing mispredicts and nukes)", + "BriefDescription": "Speculative to Retired ratio of all clears (c= overing Mispredicts and nukes)", "MetricExpr": "INT_MISC.CLEARS_COUNT / (BR_MISP_RETIRED.ALL_BRANCH= ES + MACHINE_CLEARS.COUNT)", "MetricGroup": "BrMispredicts", "MetricName": "tma_info_bad_spec_spec_clears_ratio" @@ -470,8 +619,8 @@ }, { "BriefDescription": "Total pipeline cost of DSB (uop cache) hits -= subset of the Instruction_Fetch_BW Bottleneck", - "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_bandwidth + tma_fetch_latency)) * (tma_dsb / (tma_dsb + tma_lsd = + tma_mite)))", - "MetricGroup": "DSB;FetchBW;tma_issueFB", + "MetricExpr": "100 * (tma_frontend_bound * (tma_fetch_bandwidth / = (tma_fetch_latency + tma_fetch_bandwidth)) * (tma_dsb / (tma_mite + tma_dsb= + tma_lsd + tma_ms)))", + "MetricGroup": "DSB;Fed;FetchBW;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_bandwidth", "MetricThreshold": "tma_info_botlnk_l2_dsb_bandwidth > 10", "PublicDescription": "Total pipeline cost of DSB (uop cache) hits = - subset of the Instruction_Fetch_BW Bottleneck. Related metrics: tma_dsb_s= witches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_misses, tma_info_front= end_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp" @@ -479,7 +628,7 @@ { "BriefDescription": "Total pipeline cost of DSB (uop cache) misses= - subset of the Instruction_Fetch_BW Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses + = tma_lcp + tma_ms_switches) + tma_fetch_bandwidth * tma_mite / (tma_dsb + tm= a_lsd + tma_mite))", + "MetricExpr": "100 * (tma_fetch_latency * tma_dsb_switches / (tma_= icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + t= ma_lcp + tma_dsb_switches) + tma_fetch_bandwidth * tma_mite / (tma_mite + t= ma_dsb + tma_lsd + tma_ms))", "MetricGroup": "DSBmiss;Fed;tma_issueFB", "MetricName": "tma_info_botlnk_l2_dsb_misses", "MetricThreshold": "tma_info_botlnk_l2_dsb_misses > 10", @@ -488,108 +637,10 @@ { "BriefDescription": "Total pipeline cost of Instruction Cache miss= es - subset of the Big_Code Bottleneck", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _branch_resteers + tma_dsb_switches + tma_icache_misses + tma_itlb_misses += tma_lcp + tma_ms_switches))", + "MetricExpr": "100 * (tma_fetch_latency * tma_icache_misses / (tma= _icache_misses + tma_itlb_misses + tma_branch_resteers + tma_ms_switches + = tma_lcp + tma_dsb_switches))", "MetricGroup": "Fed;FetchLat;IcMiss;tma_issueFL", "MetricName": "tma_info_botlnk_l2_ic_misses", - "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5", - "PublicDescription": "Total pipeline cost of Instruction Cache mis= ses - subset of the Big_Code Bottleneck. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch rela= ted bottlenecks by large code footprint programs (i-side cache; TLB and BTB= misses)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * tma_fetch_latency * (tma_itlb_misses + tma_ic= ache_misses + tma_unknown_branches) / (tma_branch_resteers + tma_dsb_switch= es + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches)", - "MetricGroup": "BigFootprint;BvBC;Fed;Frontend;IcMiss;MemoryTLB", - "MetricName": "tma_info_bottleneck_big_code", - "MetricThreshold": "tma_info_bottleneck_big_code > 20" - }, - { - "BriefDescription": "Total pipeline cost of instructions used for = program control-flow - a subset of the Retiring category in TMA", - "MetricExpr": "100 * ((BR_INST_RETIRED.ALL_BRANCHES + 2 * BR_INST_= RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slots)", - "MetricGroup": "BvBO;Ret", - "MetricName": "tma_info_bottleneck_branching_overhead", - "MetricThreshold": "tma_info_bottleneck_branching_overhead > 5", - "PublicDescription": "Total pipeline cost of instructions used for= program control-flow - a subset of the Retiring category in TMA. Examples = include function calls; loops and alignments. (A lower bound)" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Bandwidth related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_bandwidth / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_b= ound * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_= l3_bound + tma_store_bound)) * (tma_sq_full / (tma_contested_accesses + tma= _data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound * (tm= a_l1_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound += tma_store_bound)) * (tma_fb_full / (tma_4k_aliasing + tma_dtlb_load + tma_= fb_full + tma_l1_hit_latency + tma_lock_latency + tma_split_loads + tma_sto= re_fwd_blk)))", - "MetricGroup": "BvMB;Mem;MemoryBW;Offcore;tma_issueBW", - "MetricName": "tma_info_bottleneck_cache_memory_bandwidth", - "MetricThreshold": "tma_info_bottleneck_cache_memory_bandwidth > 2= 0", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Bandwidth related bottlenecks. Related metrics: tma_fb_full, tma_info_= system_dram_bw_use, tma_mem_bandwidth, tma_sq_full" - }, - { - "BriefDescription": "Total pipeline cost of external Memory- or Ca= che-Latency related bottlenecks", - "MetricExpr": "100 * (tma_memory_bound * (tma_dram_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) *= (tma_mem_latency / (tma_mem_bandwidth + tma_mem_latency)) + tma_memory_bou= nd * (tma_l3_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3= _bound + tma_store_bound)) * (tma_l3_hit_latency / (tma_contested_accesses = + tma_data_sharing + tma_l3_hit_latency + tma_sq_full)) + tma_memory_bound = * tma_l2_bound / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bou= nd + tma_store_bound) + tma_memory_bound * (tma_store_bound / (tma_dram_bou= nd + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)) * (tma_= store_latency / (tma_dtlb_store + tma_false_sharing + tma_split_stores + tm= a_store_latency + tma_streaming_stores)) + tma_memory_bound * (tma_l1_bound= / (tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store= _bound)) * (tma_l1_hit_latency / (tma_4k_aliasing + tma_dtlb_load + tma_fb_= full + tma_l1_hit_latency + tma_lock_latency + tma_split_loads + tma_store_= fwd_blk)))", - "MetricGroup": "BvML;Mem;MemoryLat;Offcore;tma_issueLat", - "MetricName": "tma_info_bottleneck_cache_memory_latency", - "MetricThreshold": "tma_info_bottleneck_cache_memory_latency > 20", - "PublicDescription": "Total pipeline cost of external Memory- or C= ache-Latency related bottlenecks. Related metrics: tma_l3_hit_latency, tma_= mem_latency" - }, - { - "BriefDescription": "Total pipeline cost when the execution is com= pute-bound - an estimation", - "MetricExpr": "100 * (tma_core_bound * tma_divider / (tma_divider = + tma_ports_utilization + tma_serializing_operation) + tma_core_bound * (tm= a_ports_utilization / (tma_divider + tma_ports_utilization + tma_serializin= g_operation)) * (tma_ports_utilized_3m / (tma_ports_utilized_0 + tma_ports_= utilized_1 + tma_ports_utilized_2 + tma_ports_utilized_3m)))", - "MetricGroup": "BvCB;Cor;tma_issueComp", - "MetricName": "tma_info_bottleneck_compute_bound_est", - "MetricThreshold": "tma_info_bottleneck_compute_bound_est > 20", - "PublicDescription": "Total pipeline cost when the execution is co= mpute-bound - an estimation. Covers Core Bound when High ILP as well as whe= n long-latency execution units are busy. Related metrics: " - }, - { - "BriefDescription": "Total pipeline cost of instruction fetch band= width related bottlenecks (when the front-end could not sustain operations = delivery to the back-end)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_frontend_bound - (1 - 10 * tma_microcode= _sequencer * tma_other_mispredicts / tma_branch_mispredicts) * tma_fetch_la= tency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switches = + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches) - tma_mi= crocode_sequencer / (tma_few_uops_instructions + tma_microcode_sequencer) *= (tma_assists / tma_microcode_sequencer) * tma_fetch_latency * (tma_ms_swit= ches + tma_branch_resteers * (tma_clears_resteers + tma_mispredicts_resteer= s * (10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_misp= redicts)) / (tma_clears_resteers + tma_mispredicts_resteers + tma_unknown_b= ranches)) / (tma_branch_resteers + tma_dsb_switches + tma_icache_misses + t= ma_itlb_misses + tma_lcp + tma_ms_switches)) - tma_info_bottleneck_big_code= ", - "MetricGroup": "BvFB;Fed;FetchBW;Frontend", - "MetricName": "tma_info_bottleneck_instruction_fetch_bw", - "MetricThreshold": "tma_info_bottleneck_instruction_fetch_bw > 20" - }, - { - "BriefDescription": "Total pipeline cost of irregular execution (e= .g", - "MetricExpr": "100 * (tma_microcode_sequencer / (tma_few_uops_inst= ructions + tma_microcode_sequencer) * (tma_assists / tma_microcode_sequence= r) * tma_fetch_latency * (tma_ms_switches + tma_branch_resteers * (tma_clea= rs_resteers + tma_mispredicts_resteers * (10 * tma_microcode_sequencer * tm= a_other_mispredicts / tma_branch_mispredicts)) / (tma_clears_resteers + tma= _mispredicts_resteers + tma_unknown_branches)) / (tma_branch_resteers + tma= _dsb_switches + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_swit= ches) + 10 * tma_microcode_sequencer * tma_other_mispredicts / tma_branch_m= ispredicts * tma_branch_mispredicts + tma_machine_clears * tma_other_nukes = / tma_other_nukes + tma_core_bound * (tma_serializing_operation + tma_core_= bound * RS_EVENTS.EMPTY_CYCLES / tma_info_thread_clks * tma_ports_utilized_= 0) / (tma_divider + tma_ports_utilization + tma_serializing_operation) + tm= a_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_sequence= r) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "Bad;BvIO;Cor;Ret;tma_issueMS", - "MetricName": "tma_info_bottleneck_irregular_overhead", - "MetricThreshold": "tma_info_bottleneck_irregular_overhead > 10", - "PublicDescription": "Total pipeline cost of irregular execution (= e.g. FP-assists in HPC, Wait time with work imbalance multithreaded workloa= ds, overhead in system services or virtualized environments). Related metri= cs: tma_microcode_sequencer, tma_ms_switches" - }, - { - "BriefDescription": "Total pipeline cost of Memory Address Transla= tion related bottlenecks (data-side TLBs)", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (tma_memory_bound * (tma_l1_bound / max(tma_m= emory_bound, tma_dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + = tma_store_bound)) * (tma_dtlb_load / max(tma_l1_bound, tma_4k_aliasing + tm= a_dtlb_load + tma_fb_full + tma_l1_hit_latency + tma_lock_latency + tma_spl= it_loads + tma_store_fwd_blk)) + tma_memory_bound * (tma_store_bound / (tma= _dram_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound)= ) * (tma_dtlb_store / (tma_dtlb_store + tma_false_sharing + tma_split_store= s + tma_store_latency + tma_streaming_stores)))", - "MetricGroup": "BvMT;Mem;MemoryTLB;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_data_tlbs", - "MetricThreshold": "tma_info_bottleneck_memory_data_tlbs > 20", - "PublicDescription": "Total pipeline cost of Memory Address Transl= ation related bottlenecks (data-side TLBs). Related metrics: tma_dtlb_load,= tma_dtlb_store, tma_info_bottleneck_memory_synchronization" - }, - { - "BriefDescription": "Total pipeline cost of Memory Synchronization= related bottlenecks (data transfers and coherency updates across processor= s)", - "MetricExpr": "100 * (tma_memory_bound * (tma_l3_bound / (tma_dram= _bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * (t= ma_contested_accesses + tma_data_sharing) / (tma_contested_accesses + tma_d= ata_sharing + tma_l3_hit_latency + tma_sq_full) + tma_store_bound / (tma_dr= am_bound + tma_l1_bound + tma_l2_bound + tma_l3_bound + tma_store_bound) * = tma_false_sharing / (tma_dtlb_store + tma_false_sharing + tma_split_stores = + tma_store_latency + tma_streaming_stores - tma_store_latency)) + tma_mach= ine_clears * (1 - tma_other_nukes / tma_other_nukes))", - "MetricGroup": "BvMS;Mem;Offcore;tma_issueTLB", - "MetricName": "tma_info_bottleneck_memory_synchronization", - "MetricThreshold": "tma_info_bottleneck_memory_synchronization > 1= 0", - "PublicDescription": "Total pipeline cost of Memory Synchronizatio= n related bottlenecks (data transfers and coherency updates across processo= rs). Related metrics: tma_dtlb_load, tma_dtlb_store, tma_info_bottleneck_me= mory_data_tlbs" - }, - { - "BriefDescription": "Total pipeline cost of Branch Misprediction r= elated bottlenecks", - "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "100 * (1 - 10 * tma_microcode_sequencer * tma_other= _mispredicts / tma_branch_mispredicts) * (tma_branch_mispredicts + tma_fetc= h_latency * tma_mispredicts_resteers / (tma_branch_resteers + tma_dsb_switc= hes + tma_icache_misses + tma_itlb_misses + tma_lcp + tma_ms_switches))", - "MetricGroup": "Bad;BadSpec;BrMispredicts;BvMP;tma_issueBM", - "MetricName": "tma_info_bottleneck_mispredictions", - "MetricThreshold": "tma_info_bottleneck_mispredictions > 20", - "PublicDescription": "Total pipeline cost of Branch Misprediction = related bottlenecks. Related metrics: tma_branch_mispredicts, tma_info_bad_= spec_branch_misprediction_cost, tma_mispredicts_resteers" - }, - { - "BriefDescription": "Total pipeline cost of remaining bottlenecks = in the back-end", - "MetricExpr": "100 - (tma_info_bottleneck_big_code + tma_info_bott= leneck_instruction_fetch_bw + tma_info_bottleneck_mispredictions + tma_info= _bottleneck_cache_memory_bandwidth + tma_info_bottleneck_cache_memory_laten= cy + tma_info_bottleneck_memory_data_tlbs + tma_info_bottleneck_memory_sync= hronization + tma_info_bottleneck_compute_bound_est + tma_info_bottleneck_i= rregular_overhead + tma_info_bottleneck_branching_overhead + tma_info_bottl= eneck_useful_work)", - "MetricGroup": "BvOB;Cor;Offcore", - "MetricName": "tma_info_bottleneck_other_bottlenecks", - "MetricThreshold": "tma_info_bottleneck_other_bottlenecks > 20", - "PublicDescription": "Total pipeline cost of remaining bottlenecks= in the back-end. Examples include data-dependencies (Core Bound when Low I= LP) and other unlisted memory-related stalls." - }, - { - "BriefDescription": "Total pipeline cost of \"useful operations\" = - the portion of Retiring category not covered by Branching_Overhead nor Ir= regular_Overhead.", - "MetricExpr": "100 * (tma_retiring - (BR_INST_RETIRED.ALL_BRANCHES= + 2 * BR_INST_RETIRED.NEAR_CALL + INST_RETIRED.NOP) / tma_info_thread_slot= s - tma_microcode_sequencer / (tma_few_uops_instructions + tma_microcode_se= quencer) * (tma_assists / tma_microcode_sequencer) * tma_heavy_operations)", - "MetricGroup": "BvUW;Ret", - "MetricName": "tma_info_bottleneck_useful_work", - "MetricThreshold": "tma_info_bottleneck_useful_work > 20" + "MetricThreshold": "tma_info_botlnk_l2_ic_misses > 5" }, { "BriefDescription": "Fraction of branches that are CALL or RET", @@ -650,11 +701,11 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED.VECTOR) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -667,20 +718,20 @@ "PublicDescription": "Fraction of Uops delivered by the DSB (aka D= ecoded ICache; or Uop Cache). Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_inst_mix_iptb, tma_lcp" }, { - "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= .", - "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D1\\,edge@", + "BriefDescription": "Average number of cycles of a switch from the= DSB fetch-unit to MITE fetch unit - see DSB_Switches tree node for details= ", + "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / cpu@DSB2MITE_SWI= TCHES.PENALTY_CYCLES\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "DSBmiss", "MetricName": "tma_info_frontend_dsb_switch_cost" }, { "BriefDescription": "Average number of Uops issued by front-end wh= en it issued something", - "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D1= @", + "MetricExpr": "UOPS_ISSUED.ANY / cpu@UOPS_ISSUED.ANY\\,cmask\\=3D0= x1@", "MetricGroup": "Fed;FetchBW", "MetricName": "tma_info_frontend_fetch_upc" }, { "BriefDescription": "Average Latency for L1 instruction cache miss= es", - "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D1\\,edge@", + "MetricExpr": "ICACHE_16B.IFDATA_STALL / cpu@ICACHE_16B.IFDATA_STA= LL\\,cmask\\=3D0x1\\,edge\\=3D0x1@", "MetricGroup": "Fed;FetchLat;IcMiss", "MetricName": "tma_info_frontend_icache_miss_latency" }, @@ -716,7 +767,13 @@ "MetricName": "tma_info_frontend_lsd_coverage" }, { - "BriefDescription": "Branch instructions per taken branch.", + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, + { + "BriefDescription": "Branch instructions per taken branch", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -734,7 +791,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -742,7 +799,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -750,7 +807,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX 512-bit in= struction (lower number means higher occurrence rate)", @@ -758,7 +815,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx512", "MetricThreshold": "tma_info_inst_mix_iparith_avx512 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX 512-bit i= nstruction (lower number means higher occurrence rate). Values < 1 are poss= ible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -766,7 +823,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -774,7 +831,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -819,7 +876,7 @@ }, { "BriefDescription": "Instructions per Software prefetch instructio= n (of any type: NTA/T0/T1/T2/Prefetch) (lower number means higher occurrenc= e rate)", - "MetricExpr": "INST_RETIRED.ANY / cpu@SW_PREFETCH_ACCESS.T0\\,umas= k\\=3D0xF@", + "MetricExpr": "INST_RETIRED.ANY / SW_PREFETCH_ACCESS.ANY", "MetricGroup": "Prefetches", "MetricName": "tma_info_inst_mix_ipswpf", "MetricThreshold": "tma_info_inst_mix_ipswpf < 100" @@ -829,7 +886,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 11", + "MetricThreshold": "tma_info_inst_mix_iptb < 5 * 2 + 1", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidth= , tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_lcp" }, { @@ -864,7 +921,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw" }, @@ -882,7 +939,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw" }, @@ -924,13 +981,13 @@ }, { "BriefDescription": "Average per-thread data access bandwidth to t= he L3 cache [GB / sec]", - "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / duration= _time", + "MetricExpr": "64 * OFFCORE_REQUESTS.ALL_REQUESTS / 1e9 / tma_info= _system_time", "MetricGroup": "Mem;MemoryBW;Offcore", "MetricName": "tma_info_memory_l3_cache_access_bw" }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw" }, @@ -949,21 +1006,15 @@ { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency" }, { "BriefDescription": "Average Parallel L2 cache miss demand Loads", - "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D1@", + "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / cpu@O= FFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,cmask\\=3D0x1@", "MetricGroup": "Memory_BW;Offcore", "MetricName": "tma_info_memory_latency_load_l2_mlp" }, - { - "BriefDescription": "Average Latency for L3 cache miss demand Load= s", - "MetricExpr": "cpu@OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD\\,u= mask\\=3D0x10@ / OFFCORE_REQUESTS.L3_MISS_DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", - "MetricName": "tma_info_memory_latency_load_l3_miss_latency" - }, { "BriefDescription": "Actual Average Latency for L1 data-cache miss= demand load operations (in core cycles)", "MetricExpr": "L1D_PEND_MISS.PENDING / (MEM_LOAD_RETIRED.L1_MISS += MEM_LOAD_RETIRED.FB_HIT)", @@ -989,6 +1040,13 @@ "MetricName": "tma_info_memory_mlp", "PublicDescription": "Memory-Level-Parallelism (average number of = L1 miss demand load when there is at least one such miss. Per-Logical Proce= ssor)" }, + { + "BriefDescription": "Rate of L2 HW prefetched lines that were not = used by demand accesses", + "MetricExpr": "L2_LINES_OUT.USELESS_HWPF / (L2_LINES_OUT.SILENT + = L2_LINES_OUT.NON_SILENT)", + "MetricGroup": "Prefetches", + "MetricName": "tma_info_memory_prefetches_useless_hwpf", + "MetricThreshold": "tma_info_memory_prefetches_useless_hwpf > 0.15" + }, { "BriefDescription": "STLB (2nd level TLB) code speculative misses = per kilo instruction (misses of any page-size that complete the page walk)", "MetricExpr": "1e3 * ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", @@ -1015,8 +1073,8 @@ "MetricName": "tma_info_memory_tlb_store_stlb_mpki" }, { - "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D1@)", + "BriefDescription": "", + "MetricExpr": "UOPS_EXECUTED.THREAD / (UOPS_EXECUTED.CORE_CYCLES_G= E_1 / 2 if #SMT_on else cpu@UOPS_EXECUTED.THREAD\\,cmask\\=3D0x1@)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, @@ -1043,18 +1101,18 @@ "MetricExpr": "INST_RETIRED.ANY / ASSISTS.ANY", "MetricGroup": "MicroSeq;Pipeline;Ret;Retire", "MetricName": "tma_info_pipeline_ipassist", - "MetricThreshold": "tma_info_pipeline_ipassist < 100e3", + "MetricThreshold": "tma_info_pipeline_ipassist < 100000", "PublicDescription": "Instructions per a microcode Assist invocati= on. See Assists tree node for details (lower number means higher occurrence= rate)" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "tma_retiring * tma_info_thread_slots / cpu@UOPS_RET= IRED.SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -1072,14 +1130,14 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (arb@event\\=3D0x81\\,umask\\=3D0x1@ + arb@eve= nt\\=3D0x84\\,umask\\=3D0x1@) / 1e6 / duration_time / 1e3", + "MetricExpr": "64 * (UNC_ARB_TRK_REQUESTS.ALL + UNC_ARB_COH_TRK_RE= QUESTS.ALL) / 1e6 / tma_info_system_time / 1e3", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", - "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_info_bottlenec= k_cache_memory_bandwidth, tma_mem_bandwidth, tma_sq_full" + "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_bottleneck_cache_memory_ban= dwidth, tma_fb_full, tma_mem_bandwidth, tma_sq_full" }, { "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.8_FLOPS + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1= e9 / duration_time", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.8_FLOPS + 16 * FP_ARITH_INST_RETIRED.512B_PACKED_SINGLE) / 1= e9 / tma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" @@ -1089,13 +1147,14 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k", "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi" + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -1118,12 +1177,25 @@ "MetricName": "tma_info_system_mem_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetche= s. ([RKL+]memory-controller only)" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "power@energy\\-pkg@ * 61 / (tma_info_system_time * = 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" + }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for baseline license level 0", "MetricExpr": "CORE_POWER.LVL0_TURBO_LICENSE / tma_info_core_core_= clks", "MetricGroup": "Power", "MetricName": "tma_info_system_power_license0_utilization", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for baseline license level 0. This includes non= -AVX codes, SSE, AVX 128-bit, and low-current AVX 256-bit codes" }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 1", @@ -1131,7 +1203,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license1_utilization", "MetricThreshold": "tma_info_system_power_license1_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 1. This includes high current= AVX 256-bit instructions as well as low current AVX 512-bit instructions" }, { "BriefDescription": "Fraction of Core cycles where the core was ru= nning with power-delivery for license level 2 (introduced in SKX)", @@ -1139,7 +1211,7 @@ "MetricGroup": "Power", "MetricName": "tma_info_system_power_license2_utilization", "MetricThreshold": "tma_info_system_power_license2_utilization > 0= .5", - "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions." + "PublicDescription": "Fraction of Core cycles where the core was r= unning with power-delivery for license level 2 (introduced in SKX). This i= ncludes high current AVX 512-bit instructions" }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", @@ -1153,6 +1225,13 @@ "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -1160,7 +1239,7 @@ "MetricName": "tma_info_system_turbo_utilization" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -1169,14 +1248,15 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi" + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -1186,13 +1266,13 @@ }, { "BriefDescription": "Total issue-pipeline slots (per-Physical Core= till ICL; per-Logical Processor ICL onward)", - "MetricExpr": "TOPDOWN.SLOTS", + "MetricExpr": "slots", "MetricGroup": "TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots" }, { "BriefDescription": "Fraction of Physical Core issue-slots utilize= d by this Logical Processor", - "MetricExpr": "(tma_info_thread_slots / (TOPDOWN.SLOTS / 2) if #SM= T_on else 1)", + "MetricExpr": "(tma_info_thread_slots / (slots / 2) if #SMT_on els= e 1)", "MetricGroup": "SMT;TmaL1;tma_L1_group", "MetricName": "tma_info_thread_slots_utilization" }, @@ -1208,33 +1288,41 @@ "MetricExpr": "tma_retiring * tma_info_thread_slots / BR_INST_RETI= RED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 7.5" + "MetricThreshold": "tma_info_thread_uptb < 5 * 1.5" + }, + { + "BriefDescription": "This metric represents fraction of cycles whe= re the Integer Divider unit was active", + "MetricExpr": "tma_divider - tma_fp_divider", + "MetricGroup": "TopdownL4;tma_L4_group;tma_divider_group", + "MetricName": "tma_int_divider", + "MetricThreshold": "tma_int_divider > 0.2 & tma_divider > 0.2 & tm= a_core_bound > 0.1 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", "MetricExpr": "ICACHE_TAG.STALLS / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS_PS;FRONTEND_RETIRED.ITLB_MISS_PS", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: FRONTE= ND_RETIRED.STLB_MISS, FRONTEND_RETIRED.ITLB_MISS", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT_PS;MEM_LOAD_RETIRED.FB= _HIT_PS. Related metrics: tma_clears_resteers, tma_machine_clears, tma_micr= ocode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_RETIRED.L1_HIT. Related metrics: t= ma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_ms_swi= tches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric roughly estimates fraction of cyc= les with demand load accesses that hit the L1 cache", + "BriefDescription": "This metric([SKL+] roughly; [LNL]) estimates = fraction of cycles with demand load accesses that hit the L1D cache", "MetricExpr": "min(2 * (MEM_INST_RETIRED.ALL_LOADS - MEM_LOAD_RETI= RED.FB_HIT - MEM_LOAD_RETIRED.L1_MISS) * 20 / 100, max(CYCLE_ACTIVITY.CYCLE= S_MEM_ANY - CYCLE_ACTIVITY.CYCLES_L1D_MISS, 0)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_l1_bound= _group", - "MetricName": "tma_l1_hit_latency", - "MetricThreshold": "tma_l1_hit_latency > 0.1 & (tma_l1_bound > 0.1= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles with demand load accesses that hit the L1 cache. The short latency of = the L1 data cache may be exposed in pointer-chasing memory access patterns = as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", + "MetricName": "tma_l1_latency_dependency", + "MetricThreshold": "tma_l1_latency_dependency > 0.1 & tma_l1_bound= > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric([SKL+] roughly; [LNL]) estimates= fraction of cycles with demand load accesses that hit the L1D cache. The s= hort latency of the L1D cache may be exposed in pointer-chasing memory acce= ss patterns as an example. Sample with: MEM_LOAD_RETIRED.L1_HIT", "ScaleUnit": "100%" }, { @@ -1243,8 +1331,17 @@ "MetricExpr": "MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_= HIT / MEM_LOAD_RETIRED.L1_MISS) / (MEM_LOAD_RETIRED.L2_HIT * (1 + MEM_LOAD_= RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS) + L1D_PEND_MISS.FB_FULL_PERIODS)= * ((CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.STALLS_L2_MISS) / tma_= info_thread_clks)", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT_PS", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_RETIRED.L2_HIT", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents fraction of cycles wit= h demand load accesses that hit the L2 cache under unloaded scenarios (poss= ibly L2 latency limited)", + "MetricExpr": "5 * tma_info_system_core_frequency * MEM_LOAD_RETIR= ED.L2_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2) / = tma_info_thread_clks", + "MetricGroup": "MemoryLat;TopdownL4;tma_L4_group;tma_l2_bound_grou= p", + "MetricName": "tma_l2_hit_latency", + "MetricThreshold": "tma_l2_hit_latency > 0.05 & tma_l2_bound > 0.0= 5 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wi= th demand load accesses that hit the L2 cache under unloaded scenarios (pos= sibly L2 latency limited). Avoiding L1 cache misses (i.e. L1 misses/L2 hit= s) will improve the latency. Sample with: MEM_LOAD_RETIRED.L2_HIT", "ScaleUnit": "100%" }, { @@ -1253,17 +1350,17 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L2_MISS - CYCLE_ACTIVITY.STA= LLS_L3_MISS) / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_RETIRED.L3_HIT", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles with= demand load accesses that hit the L3 cache under unloaded scenarios (possi= bly L3 latency limited)", - "MetricExpr": "17.5 * tma_info_system_core_frequency * (MEM_LOAD_R= ETIRED.L3_HIT * (1 + MEM_LOAD_RETIRED.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2= )) / tma_info_thread_clks", + "MetricExpr": "(22.5 * tma_info_system_core_frequency - 5 * tma_in= fo_system_core_frequency) * (MEM_LOAD_RETIRED.L3_HIT * (1 + MEM_LOAD_RETIRE= D.FB_HIT / MEM_LOAD_RETIRED.L1_MISS / 2)) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT_PS. Related metrics: tm= a_info_bottleneck_cache_memory_latency, tma_mem_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_RETIRED.L3_HIT. Related metrics: tma_b= ottleneck_cache_memory_latency, tma_branch_resteers, tma_mem_latency, tma_s= tore_latency", "ScaleUnit": "100%" }, { @@ -1271,18 +1368,18 @@ "MetricExpr": "DECODE.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_botlnk_l2_dsb_bandwidth, tma_info_botlnk_l2_dsb_misses,= tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_botlnk_l2_dsb_bandwidt= h, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_= inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", "ScaleUnit": "100%" }, { @@ -1299,7 +1396,7 @@ "MetricExpr": "tma_dtlb_load - tma_load_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_hit", - "MetricThreshold": "tma_load_stlb_hit > 0.05 & (tma_dtlb_load > 0.= 1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", + "MetricThreshold": "tma_load_stlb_hit > 0.05 & tma_dtlb_load > 0.1= & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1307,16 +1404,40 @@ "MetricExpr": "DTLB_LOAD_MISSES.WALK_ACTIVE / tma_info_thread_clks= ", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_load_gro= up", "MetricName": "tma_load_stlb_miss", - "MetricThreshold": "tma_load_stlb_miss > 0.05 & (tma_dtlb_load > 0= .1 & (tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.= 2)))", + "MetricThreshold": "tma_load_stlb_miss > 0.05 & tma_dtlb_load > 0.= 1 & tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_1G / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_1g", + "MetricThreshold": "tma_load_stlb_miss_1g > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPL= ETED_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_2m", + "MetricThreshold": "tma_load_stlb_miss_2m > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data load accesses", + "MetricExpr": "tma_load_stlb_miss * DTLB_LOAD_MISSES.WALK_COMPLETE= D_4K / (DTLB_LOAD_MISSES.WALK_COMPLETED_4K + DTLB_LOAD_MISSES.WALK_COMPLETE= D_2M_4M + DTLB_LOAD_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_load_stlb_mis= s_group", + "MetricName": "tma_load_stlb_miss_4k", + "MetricThreshold": "tma_load_stlb_miss_4k > 0.05 & tma_load_stlb_m= iss > 0.05 & tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma_memory_bound > = 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(16 * max(0, MEM_INST_RETIRED.LOCK_LOADS - L2_RQSTS= .ALL_RFO) + MEM_INST_RETIRED.LOCK_LOADS / MEM_INST_RETIRED.ALL_STORES * (10= * L2_RQSTS.RFO_HIT + min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTAN= DING.CYCLES_WITH_DEMAND_RFO))) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_INST_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, @@ -1326,7 +1447,7 @@ "MetricGroup": "FetchBW;LSD;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_lsd", "MetricThreshold": "tma_lsd > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to LSD (Loop Stream Detector) unit. = LSD typically does well sustaining Uop supply. However; in some rare cases= ; optimal uop-delivery could not be reached for small loops whose size (in = terms of number of uops) does not suit well the LSD structure", "ScaleUnit": "100%" }, { @@ -1336,16 +1457,16 @@ "MetricName": "tma_machine_clears", "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation= > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sh= aring, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_c= ache", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Machine Clears. These slots are either wasted by uo= ps fetched prior to the clear; or stalls the out-of-order portion of the ma= chine needs to recover its state after the clear. For example; this can hap= pen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modif= ying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: = tma_bottleneck_memory_synchronization, tma_clears_resteers, tma_contested_a= ccesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_s= equencer, tma_ms_switches", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_bottleneck_cache= _memory_bandwidth, tma_info_system_dram_bw_use, tma_sq_full", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_bottleneck_cache_memory_bandwidth,= tma_fb_full, tma_info_system_dram_bw_use, tma_sq_full", "ScaleUnit": "100%" }, { @@ -1353,8 +1474,8 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_info_bottleneck_cache_memory_latency, tma_l3_hit_la= tency", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_bottleneck_cache_memory_latency, tma_l3_hit_latency= ", "ScaleUnit": "100%" }, { @@ -1364,11 +1485,11 @@ "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations -- uops for memory load or store a= ccesses.", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring memory operations , uops for memory load or store ac= cesses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "tma_light_operations * MEM_INST_RETIRED.ANY / INST_= RETIRED.ANY", "MetricGroup": "Pipeline;TopdownL3;tma_L3_group;tma_light_operatio= ns_group", @@ -1382,7 +1503,7 @@ "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_heavy_operatio= ns_group;tma_issueMC;tma_issueMS", "MetricName": "tma_microcode_sequencer", "MetricThreshold": "tma_microcode_sequencer > 0.05 & tma_heavy_ope= rations > 0.1", - "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_clears_resteers,= tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_clears, = tma_ms_switches", + "PublicDescription": "This metric represents fraction of slots the= CPU was retiring uops fetched by the Microcode Sequencer (MS) unit. The M= S is used for CISC instructions not supported by the default decoders (like= repeat move strings; or CPUID); or by microcode assists used to address so= me operation modes (like in Floating Point assists). These cases can often = be avoided. Sample with: IDQ.MS_UOPS. Related metrics: tma_bottleneck_irreg= ular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clears, tma_m= s_switches", "ScaleUnit": "100%" }, { @@ -1390,8 +1511,8 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES / (BR_MISP_RETIRED.ALL= _BRANCHES + MACHINE_CLEARS.COUNT) * INT_MISC.CLEAR_RESTEER_CYCLES / tma_inf= o_thread_clks", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_branch_mispredicts, tma_info_bad_spec_branch_misprediction_cost= , tma_info_bottleneck_mispredictions", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Sample with: INT_MISC.CLEAR_RESTEER_CYCLES. Related m= etrics: tma_bottleneck_mispredictions, tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", "ScaleUnit": "100%" }, { @@ -1405,19 +1526,27 @@ }, { "BriefDescription": "This metric represents fraction of cycles whe= re (only) 4 uops were delivered by the MITE pipeline", - "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=3D4@ - cpu@IDQ.MITE_UO= PS\\,cmask\\=3D5@) / tma_info_thread_clks", + "MetricExpr": "(cpu@IDQ.MITE_UOPS\\,cmask\\=3D0x4@ - cpu@IDQ.MITE_= UOPS\\,cmask\\=3D0x5@) / tma_info_thread_clks", "MetricGroup": "DSBmiss;FetchBW;TopdownL4;tma_L4_group;tma_mite_gr= oup", "MetricName": "tma_mite_4wide", - "MetricThreshold": "tma_mite_4wide > 0.05 & (tma_mite > 0.1 & tma_= fetch_bandwidth > 0.2)", + "MetricThreshold": "tma_mite_4wide > 0.05 & tma_mite > 0.1 & tma_f= etch_bandwidth > 0.2", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued -- the Count D= omain; [ADL+] cycles)", + "BriefDescription": "This metric estimates penalty in terms of per= centage of([SKL+] injected blend uops out of all Uops Issued , the Count Do= main; [ADL+] cycles)", "MetricExpr": "UOPS_ISSUED.VECTOR_WIDTH_MISMATCH / UOPS_ISSUED.ANY= ", "MetricGroup": "TopdownL5;tma_L5_group;tma_issueMV;tma_ports_utili= zed_0_group", "MetricName": "tma_mixing_vectors", "MetricThreshold": "tma_mixing_vectors > 0.05", - "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued -- the Count = Domain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investiga= ting. Read more in Appendix B1 of the Optimizations Guide for this topic. R= elated metrics: tma_ms_switches", + "PublicDescription": "This metric estimates penalty in terms of pe= rcentage of([SKL+] injected blend uops out of all Uops Issued , the Count D= omain; [ADL+] cycles). Usually a Mixing_Vectors over 5% is worth investigat= ing. Read more in Appendix B1 of the Optimizations Guide for this topic. Re= lated metrics: tma_ms_switches", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric represents Core fraction of cycle= s in which CPU was likely limited due to the Microcode Sequencer (MS) unit = - see Microcode_Sequencer node for details", + "MetricExpr": "cpu@IDQ.MS_UOPS\\,cmask\\=3D0x1@ / tma_info_core_co= re_clks / 2", + "MetricGroup": "MicroSeq;TopdownL3;tma_L3_group;tma_fetch_bandwidt= h_group", + "MetricName": "tma_ms", + "MetricThreshold": "tma_ms > 0.05 & tma_fetch_bandwidth > 0.2", "ScaleUnit": "100%" }, { @@ -1425,8 +1554,8 @@ "MetricExpr": "3 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_info_bottleneck_irregular_overhead, tma_l1_bound, tma_machine_= clears, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operat= ion", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_bottlene= ck_irregular_overhead, tma_clears_resteers, tma_l1_bound, tma_machine_clear= s, tma_microcode_sequencer, tma_mixing_vectors, tma_serializing_operation", "ScaleUnit": "100%" }, { @@ -1434,7 +1563,7 @@ "MetricExpr": "tma_light_operations * INST_RETIRED.NOP / (tma_reti= ring * tma_info_thread_slots)", "MetricGroup": "BvBO;Pipeline;TopdownL4;tma_L4_group;tma_other_lig= ht_ops_group", "MetricName": "tma_nop_instructions", - "MetricThreshold": "tma_nop_instructions > 0.1 & (tma_other_light_= ops > 0.3 & tma_light_operations > 0.6)", + "MetricThreshold": "tma_nop_instructions > 0.1 & tma_other_light_o= ps > 0.3 & tma_light_operations > 0.6", "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring NOP (no op) instructions. Compilers often use NOPs = for certain address alignments - e.g. start address of a function or loop b= ody. Sample with: INST_RETIRED.NOP", "ScaleUnit": "100%" }, @@ -1449,19 +1578,19 @@ "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types).", + "BriefDescription": "This metric estimates fraction of slots the C= PU was stalled due to other cases of misprediction (non-retired x86 branche= s or other types)", "MetricExpr": "max(tma_branch_mispredicts * (1 - BR_MISP_RETIRED.A= LL_BRANCHES / (INT_MISC.CLEARS_COUNT - MACHINE_CLEARS.COUNT)), 0.0001)", "MetricGroup": "BrMispredicts;BvIO;TopdownL3;tma_L3_group;tma_bran= ch_mispredicts_group", "MetricName": "tma_other_mispredicts", - "MetricThreshold": "tma_other_mispredicts > 0.05 & (tma_branch_mis= predicts > 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_mispredicts > 0.05 & tma_branch_misp= redicts > 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= .", + "BriefDescription": "This metric represents fraction of slots the = CPU has wasted due to Nukes (Machine Clears) not related to memory ordering= ", "MetricExpr": "max(tma_machine_clears * (1 - MACHINE_CLEARS.MEMORY= _ORDERING / MACHINE_CLEARS.COUNT), 0.0001)", "MetricGroup": "BvIO;Machine_Clears;TopdownL3;tma_L3_group;tma_mac= hine_clears_group", "MetricName": "tma_other_nukes", - "MetricThreshold": "tma_other_nukes > 0.05 & (tma_machine_clears >= 0.1 & tma_bad_speculation > 0.15)", + "MetricThreshold": "tma_other_nukes > 0.05 & tma_machine_clears > = 0.1 & tma_bad_speculation > 0.15", "ScaleUnit": "100%" }, { @@ -1497,7 +1626,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_6. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED.PORT_1. Related metrics: tma_fp_scalar= , tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b= , tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1505,8 +1634,8 @@ "MetricExpr": "((tma_ports_utilized_0 * tma_info_thread_clks + (EX= E_ACTIVITY.1_PORTS_UTIL + tma_retiring * EXE_ACTIVITY.2_PORTS_UTIL)) / tma_= info_thread_clks if ARITH.DIVIDER_ACTIVE < CYCLE_ACTIVITY.STALLS_TOTAL - CY= CLE_ACTIVITY.STALLS_MEM_ANY else (EXE_ACTIVITY.1_PORTS_UTIL + tma_retiring = * EXE_ACTIVITY.2_PORTS_UTIL) / tma_info_thread_clks)", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", "ScaleUnit": "100%" }, { @@ -1514,8 +1643,8 @@ "MetricExpr": "EXE_ACTIVITY.EXE_BOUND_0_PORTS / tma_info_thread_cl= ks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", "ScaleUnit": "100%" }, { @@ -1523,7 +1652,7 @@ "MetricExpr": "EXE_ACTIVITY.1_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Sample with: EXE_ACTIVITY.1_PORTS_UTIL. Related m= etrics: tma_l1_bound", "ScaleUnit": "100%" }, @@ -1532,7 +1661,7 @@ "MetricExpr": "EXE_ACTIVITY.2_PORTS_UTIL / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. S= ample with: EXE_ACTIVITY.2_PORTS_UTIL. Related metrics: tma_fp_scalar, tma_= fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_= port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, @@ -1541,14 +1670,14 @@ "MetricExpr": "UOPS_EXECUTED.CYCLES_GE_3 / tma_info_thread_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 3 or more uops per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise). Sample with:= UOPS_EXECUTED.CYCLES_GE_3", "ScaleUnit": "100%" }, { "BriefDescription": "This category represents fraction of slots ut= ilized by useful work i.e. issued uops that eventually get retired", "DefaultMetricgroupName": "TopdownL1", - "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_= thread_slots", + "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdow= n\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * slots", "MetricGroup": "BvUW;Default;TmaL1;TopdownL1;tma_L1_group", "MetricName": "tma_retiring", "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.= 1", @@ -1561,7 +1690,7 @@ "MetricExpr": "RESOURCE_STALLS.SCOREBOARD / tma_info_thread_clks", "MetricGroup": "BvIO;PortsUtil;TopdownL3;tma_L3_group;tma_core_bou= nd_group;tma_issueSO", "MetricName": "tma_serializing_operation", - "MetricThreshold": "tma_serializing_operation > 0.1 & (tma_core_bo= und > 0.1 & tma_backend_bound > 0.2)", + "MetricThreshold": "tma_serializing_operation > 0.1 & tma_core_bou= nd > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU issue-pipeline was stalled due to serializing operations. Instruction= s like CPUID; WRMSR or LFENCE serialize the out-of-order execution which ma= y limit performance. Sample with: RESOURCE_STALLS.SCOREBOARD. Related metri= cs: tma_ms_switches", "ScaleUnit": "100%" }, @@ -1570,7 +1699,7 @@ "MetricExpr": "140 * MISC_RETIRED.PAUSE_INST / tma_info_thread_clk= s", "MetricGroup": "TopdownL4;tma_L4_group;tma_serializing_operation_g= roup", "MetricName": "tma_slow_pause", - "MetricThreshold": "tma_slow_pause > 0.05 & (tma_serializing_opera= tion > 0.1 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_slow_pause > 0.05 & tma_serializing_operat= ion > 0.1 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to PAUSE Instructions. Sample with: MISC_RETIRED.PAUS= E_INST", "ScaleUnit": "100%" }, @@ -1579,8 +1708,8 @@ "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS_PS", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_INST_RETIRED.SPLIT_LOADS", "ScaleUnit": "100%" }, { @@ -1589,17 +1718,17 @@ "MetricExpr": "MEM_INST_RETIRED.SPLIT_STORES / tma_info_core_core_= clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_INST_RETIRED.SPLIT_STORES", "ScaleUnit": "100%" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", "MetricExpr": "L1D_PEND_MISS.L2_STALL / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_bottleneck_cache_memory_bandwidth, tma_info_system_dram_bw_use, = tma_mem_bandwidth", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_bottlen= eck_cache_memory_bandwidth, tma_fb_full, tma_info_system_dram_bw_use, tma_m= em_bandwidth", "ScaleUnit": "100%" }, { @@ -1607,8 +1736,8 @@ "MetricExpr": "EXE_ACTIVITY.BOUND_ON_STORES / tma_info_thread_clks= ", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES_PS", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_INST_RETIRED.ALL_STORES", "ScaleUnit": "100%" }, { @@ -1617,17 +1746,17 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", "MetricExpr": "(L2_RQSTS.RFO_HIT * 10 * (1 - MEM_INST_RETIRED.LOCK= _LOADS / MEM_INST_RETIRED.ALL_STORES) + (1 - MEM_INST_RETIRED.LOCK_LOADS / = MEM_INST_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUEST= S_OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -1644,7 +1773,7 @@ "MetricExpr": "tma_dtlb_store - tma_store_stlb_miss", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_hit", - "MetricThreshold": "tma_store_stlb_hit > 0.05 & (tma_dtlb_store > = 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound= > 0.2)))", + "MetricThreshold": "tma_store_stlb_hit > 0.05 & tma_dtlb_store > 0= .05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > = 0.2", "ScaleUnit": "100%" }, { @@ -1652,7 +1781,31 @@ "MetricExpr": "DTLB_STORE_MISSES.WALK_ACTIVE / tma_info_core_core_= clks", "MetricGroup": "MemoryTLB;TopdownL5;tma_L5_group;tma_dtlb_store_gr= oup", "MetricName": "tma_store_stlb_miss", - "MetricThreshold": "tma_store_stlb_miss > 0.05 & (tma_dtlb_store >= 0.05 & (tma_store_bound > 0.2 & (tma_memory_bound > 0.2 & tma_backend_boun= d > 0.2)))", + "MetricThreshold": "tma_store_stlb_miss > 0.05 & tma_dtlb_store > = 0.05 & tma_store_bound > 0.2 & tma_memory_bound > 0.2 & tma_backend_bound >= 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 1 GB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_1G / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_1g", + "MetricThreshold": "tma_store_stlb_miss_1g > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 2 or 4 MB page= s for data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_2M_4M / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_C= OMPLETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_2m", + "MetricThreshold": "tma_store_stlb_miss_2m > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", + "ScaleUnit": "100%" + }, + { + "BriefDescription": "This metric estimates the fraction of cycles = to walk the memory paging structures to cache translation of 4 KB pages for= data store accesses", + "MetricExpr": "tma_store_stlb_miss * DTLB_STORE_MISSES.WALK_COMPLE= TED_4K / (DTLB_STORE_MISSES.WALK_COMPLETED_4K + DTLB_STORE_MISSES.WALK_COMP= LETED_2M_4M + DTLB_STORE_MISSES.WALK_COMPLETED_1G)", + "MetricGroup": "MemoryTLB;TopdownL6;tma_L6_group;tma_store_stlb_mi= ss_group", + "MetricName": "tma_store_stlb_miss_4k", + "MetricThreshold": "tma_store_stlb_miss_4k > 0.05 & tma_store_stlb= _miss > 0.05 & tma_dtlb_store > 0.05 & tma_store_bound > 0.2 & tma_memory_b= ound > 0.2 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1660,7 +1813,7 @@ "MetricExpr": "9 * OCR.STREAMING_WR.ANY_RESPONSE / tma_info_thread= _clks", "MetricGroup": "MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_issueS= mSt;tma_store_bound_group", "MetricName": "tma_streaming_stores", - "MetricThreshold": "tma_streaming_stores > 0.2 & (tma_store_bound = > 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_streaming_stores > 0.2 & tma_store_bound >= 0.2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates how often CPU was stal= led due to Streaming store memory accesses; Streaming store optimize out a= read request required by RFO stores. Even though store accesses do not typ= ically stall out-of-order CPUs; there are few cases where stores can lead t= o actual stalls. This metric will be flagged should Streaming stores be a b= ottleneck. Sample with: OCR.STREAMING_WR.ANY_RESPONSE. Related metrics: tma= _fb_full", "ScaleUnit": "100%" }, @@ -1669,7 +1822,7 @@ "MetricExpr": "10 * BACLEARS.ANY / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, @@ -1678,8 +1831,8 @@ "MetricExpr": "tma_retiring * UOPS_EXECUTED.X87 / UOPS_EXECUTED.TH= READ", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group", "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", "ScaleUnit": "100%" }, { diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/uncore-interconnect.j= son b/tools/perf/pmu-events/arch/x86/tigerlake/uncore-interconnect.json index 1500bf109c99..3732b6a68fe8 100644 --- a/tools/perf/pmu-events/arch/x86/tigerlake/uncore-interconnect.json +++ b/tools/perf/pmu-events/arch/x86/tigerlake/uncore-interconnect.json @@ -1,6 +1,6 @@ [ { - "BriefDescription": "UNC_ARB_COH_TRK_REQUESTS.ALL", + "BriefDescription": "Number of entries allocated. Account for Any = type: e.g. Snoop, etc.", "Counter": "0,1", "EventCode": "0x84", "EventName": "UNC_ARB_COH_TRK_REQUESTS.ALL", @@ -90,7 +90,7 @@ "Unit": "ARB" }, { - "BriefDescription": "UNC_ARB_TRK_REQUESTS.ALL", + "BriefDescription": "Total number of all outgoing entries allocate= d. Accounts for Coherent and non-coherent traffic.", "Counter": "0,1", "EventCode": "0x81", "EventName": "UNC_ARB_TRK_REQUESTS.ALL", diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/uncore-other.json b/t= ools/perf/pmu-events/arch/x86/tigerlake/uncore-other.json index cc8110ac020c..1ac5b5ef8094 100644 --- a/tools/perf/pmu-events/arch/x86/tigerlake/uncore-other.json +++ b/tools/perf/pmu-events/arch/x86/tigerlake/uncore-other.json @@ -1,6 +1,6 @@ [ { - "BriefDescription": "UNC_CLOCK.SOCKET", + "BriefDescription": "This 48-bit fixed counter counts the UCLK cyc= les.", "Counter": "FIXED", "EventCode": "0xff", "EventName": "UNC_CLOCK.SOCKET", diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/virtual-memory.json b= /tools/perf/pmu-events/arch/x86/tigerlake/virtual-memory.json index 62dc0fc76748..76eeca979e0b 100644 --- a/tools/perf/pmu-events/arch/x86/tigerlake/virtual-memory.json +++ b/tools/perf/pmu-events/arch/x86/tigerlake/virtual-memory.json @@ -27,6 +27,15 @@ "SampleAfterValue": "100003", "UMask": "0xe" }, + { + "BriefDescription": "Page walks completed due to a demand data loa= d to a 1G page.", + "Counter": "0,1,2,3", + "EventCode": "0x08", + "EventName": "DTLB_LOAD_MISSES.WALK_COMPLETED_1G", + "PublicDescription": "Counts completed page walks (1G sizes) caus= ed by demand data loads. This implies address translations missed in the DT= LB and further levels of TLB. The page walk can end with or without a fault= .", + "SampleAfterValue": "100003", + "UMask": "0x8" + }, { "BriefDescription": "Page walks completed due to a demand data loa= d to a 2M/4M page.", "Counter": "0,1,2,3", @@ -82,6 +91,15 @@ "SampleAfterValue": "100003", "UMask": "0xe" }, + { + "BriefDescription": "Page walks completed due to a demand data sto= re to a 1G page.", + "Counter": "0,1,2,3", + "EventCode": "0x49", + "EventName": "DTLB_STORE_MISSES.WALK_COMPLETED_1G", + "PublicDescription": "Counts page walks completed due to demand da= ta stores whose address translations missed in the TLB and were mapped to 1= G pages. The page walks can end with or without a page fault.", + "SampleAfterValue": "100003", + "UMask": "0x8" + }, { "BriefDescription": "Page walks completed due to a demand data sto= re to a 2M/4M page.", "Counter": "0,1,2,3", --=20 2.48.0.rc2.279.g1de40edade-goog